@pencil-agent/nano-pencil 2.0.0-beta.9 → 2.0.0
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- package/README.md +267 -267
- package/dist/build-meta.json +3 -3
- package/dist/core/export-html/AGENT.md +11 -11
- package/dist/core/export-html/template.css +971 -971
- package/dist/core/export-html/template.html +54 -54
- package/dist/core/extensions-host/index.d.ts +1 -1
- package/dist/core/extensions-host/types.d.ts +5 -8
- package/dist/extensions/builtin/AGENT.md +115 -115
- package/dist/extensions/builtin/browser/AGENT.md +17 -17
- package/dist/extensions/builtin/browser/agent-workspace/agent_helpers.py +12 -12
- package/dist/extensions/builtin/browser/agent-workspace/domain-skills/amazon/product-search.md +198 -198
- package/dist/extensions/builtin/browser/agent-workspace/domain-skills/archive-org/scraping.md +341 -341
- package/dist/extensions/builtin/browser/agent-workspace/domain-skills/arxiv/scraping.md +311 -311
- package/dist/extensions/builtin/browser/agent-workspace/domain-skills/arxiv-bulk/scraping.md +333 -333
- package/dist/extensions/builtin/browser/agent-workspace/domain-skills/atlas/overview.md +70 -70
- package/dist/extensions/builtin/browser/agent-workspace/domain-skills/booking-com/scraping.md +578 -578
- package/dist/extensions/builtin/browser/agent-workspace/domain-skills/capterra/scraping.md +440 -440
- package/dist/extensions/builtin/browser/agent-workspace/domain-skills/centilebrain/generate-estimates.md +110 -110
- package/dist/extensions/builtin/browser/agent-workspace/domain-skills/coingecko/scraping.md +325 -325
- package/dist/extensions/builtin/browser/agent-workspace/domain-skills/coinmarketcap/scraping.md +463 -463
- package/dist/extensions/builtin/browser/agent-workspace/domain-skills/coursera/scraping.md +360 -360
- package/dist/extensions/builtin/browser/agent-workspace/domain-skills/craigslist/scraping.md +390 -390
- package/dist/extensions/builtin/browser/agent-workspace/domain-skills/crossref/scraping.md +568 -568
- package/dist/extensions/builtin/browser/agent-workspace/domain-skills/dev-to/scraping.md +323 -323
- package/dist/extensions/builtin/browser/agent-workspace/domain-skills/duckduckgo/scraping.md +349 -349
- package/dist/extensions/builtin/browser/agent-workspace/domain-skills/ebay/scraping.md +435 -435
- package/dist/extensions/builtin/browser/agent-workspace/domain-skills/etsy/scraping.md +506 -506
- package/dist/extensions/builtin/browser/agent-workspace/domain-skills/eventbrite/scraping.md +363 -363
- package/dist/extensions/builtin/browser/agent-workspace/domain-skills/expedia/automation.md +168 -168
- package/dist/extensions/builtin/browser/agent-workspace/domain-skills/facebook/groups.md +236 -236
- package/dist/extensions/builtin/browser/agent-workspace/domain-skills/facebook/pages.md +295 -295
- package/dist/extensions/builtin/browser/agent-workspace/domain-skills/framer/editor.md +108 -108
- package/dist/extensions/builtin/browser/agent-workspace/domain-skills/fred/scraping.md +493 -493
- package/dist/extensions/builtin/browser/agent-workspace/domain-skills/g2/scraping.md +580 -580
- package/dist/extensions/builtin/browser/agent-workspace/domain-skills/genius/scraping.md +511 -511
- package/dist/extensions/builtin/browser/agent-workspace/domain-skills/github/repo-actions.md +65 -65
- package/dist/extensions/builtin/browser/agent-workspace/domain-skills/github/scraping.md +184 -184
- package/dist/extensions/builtin/browser/agent-workspace/domain-skills/glassdoor/scraping.md +543 -543
- package/dist/extensions/builtin/browser/agent-workspace/domain-skills/gmail/compose.md +122 -122
- package/dist/extensions/builtin/browser/agent-workspace/domain-skills/goodreads/scraping.md +461 -461
- package/dist/extensions/builtin/browser/agent-workspace/domain-skills/gutenberg/scraping.md +383 -383
- package/dist/extensions/builtin/browser/agent-workspace/domain-skills/hackernews/scraping.md +243 -243
- package/dist/extensions/builtin/browser/agent-workspace/domain-skills/howlongtobeat/scraping.md +473 -473
- package/dist/extensions/builtin/browser/agent-workspace/domain-skills/imdb/scraping.md +271 -271
- package/dist/extensions/builtin/browser/agent-workspace/domain-skills/itch-io/scraping.md +436 -436
- package/dist/extensions/builtin/browser/agent-workspace/domain-skills/job-boards/indeed-glassdoor.md +1021 -1021
- package/dist/extensions/builtin/browser/agent-workspace/domain-skills/letterboxd/scraping.md +349 -349
- package/dist/extensions/builtin/browser/agent-workspace/domain-skills/linkedin/invitation-manager.md +109 -109
- package/dist/extensions/builtin/browser/agent-workspace/domain-skills/loom/folder-enumeration.md +170 -170
- package/dist/extensions/builtin/browser/agent-workspace/domain-skills/macrotrends/scraping.md +537 -537
- package/dist/extensions/builtin/browser/agent-workspace/domain-skills/medium/article-hydration.md +120 -120
- package/dist/extensions/builtin/browser/agent-workspace/domain-skills/medium/scraping.md +414 -414
- package/dist/extensions/builtin/browser/agent-workspace/domain-skills/metacritic/scraping.md +477 -477
- package/dist/extensions/builtin/browser/agent-workspace/domain-skills/musicbrainz/scraping.md +478 -478
- package/dist/extensions/builtin/browser/agent-workspace/domain-skills/nasa/scraping.md +339 -339
- package/dist/extensions/builtin/browser/agent-workspace/domain-skills/news-aggregation/multi-source.md +205 -205
- package/dist/extensions/builtin/browser/agent-workspace/domain-skills/open-library/scraping.md +472 -472
- package/dist/extensions/builtin/browser/agent-workspace/domain-skills/openalex/scraping.md +470 -470
- package/dist/extensions/builtin/browser/agent-workspace/domain-skills/openstreetmap/scraping.md +490 -490
- package/dist/extensions/builtin/browser/agent-workspace/domain-skills/package-registries/npm-pypi.md +478 -478
- package/dist/extensions/builtin/browser/agent-workspace/domain-skills/polymarket/scraping.md +234 -234
- package/dist/extensions/builtin/browser/agent-workspace/domain-skills/producthunt/scraping.md +307 -307
- package/dist/extensions/builtin/browser/agent-workspace/domain-skills/pubmed/scraping.md +421 -421
- package/dist/extensions/builtin/browser/agent-workspace/domain-skills/quora/scraping.md +364 -364
- package/dist/extensions/builtin/browser/agent-workspace/domain-skills/rawg/scraping.md +352 -352
- package/dist/extensions/builtin/browser/agent-workspace/domain-skills/reddit/scraping.md +124 -124
- package/dist/extensions/builtin/browser/agent-workspace/domain-skills/rest-countries/scraping.md +233 -233
- package/dist/extensions/builtin/browser/agent-workspace/domain-skills/sec-edgar/scraping.md +361 -361
- package/dist/extensions/builtin/browser/agent-workspace/domain-skills/shopify-admin/README.md +36 -36
- package/dist/extensions/builtin/browser/agent-workspace/domain-skills/shopify-admin/embedded-apps.md +72 -72
- package/dist/extensions/builtin/browser/agent-workspace/domain-skills/shopify-admin/knowledge-base.md +109 -109
- package/dist/extensions/builtin/browser/agent-workspace/domain-skills/shopify-admin/polaris-inputs.md +137 -137
- package/dist/extensions/builtin/browser/agent-workspace/domain-skills/soundcloud/scraping.md +362 -362
- package/dist/extensions/builtin/browser/agent-workspace/domain-skills/spotify/scraping.md +339 -339
- package/dist/extensions/builtin/browser/agent-workspace/domain-skills/stackoverflow/scraping.md +435 -435
- package/dist/extensions/builtin/browser/agent-workspace/domain-skills/steam/scraping.md +575 -575
- package/dist/extensions/builtin/browser/agent-workspace/domain-skills/substack/scraping.md +338 -338
- package/dist/extensions/builtin/browser/agent-workspace/domain-skills/thetechgeeks/pricing.md +52 -52
- package/dist/extensions/builtin/browser/agent-workspace/domain-skills/tiktok/upload.md +107 -107
- package/dist/extensions/builtin/browser/agent-workspace/domain-skills/tradingview/scraping.md +309 -309
- package/dist/extensions/builtin/browser/agent-workspace/domain-skills/trello/boards-and-lists.md +88 -88
- package/dist/extensions/builtin/browser/agent-workspace/domain-skills/trustpilot/scraping.md +375 -375
- package/dist/extensions/builtin/browser/agent-workspace/domain-skills/walmart/scraping.md +444 -444
- package/dist/extensions/builtin/browser/agent-workspace/domain-skills/wayback-machine/scraping.md +306 -306
- package/dist/extensions/builtin/browser/agent-workspace/domain-skills/weather/scraping.md +398 -398
- package/dist/extensions/builtin/browser/agent-workspace/domain-skills/wellfound/scraping.md +596 -596
- package/dist/extensions/builtin/browser/agent-workspace/domain-skills/world-bank/scraping.md +356 -356
- package/dist/extensions/builtin/browser/agent-workspace/domain-skills/xiaohongshu/scraping.md +84 -84
- package/dist/extensions/builtin/browser/agent-workspace/domain-skills/youtube/scraping.md +418 -418
- package/dist/extensions/builtin/browser/agent-workspace/domain-skills/zillow/scraping.md +433 -433
- package/dist/extensions/builtin/browser/browser.md +73 -73
- package/dist/extensions/builtin/browser/install.md +142 -142
- package/dist/extensions/builtin/browser/interaction-skills/connection.md +48 -48
- package/dist/extensions/builtin/browser/interaction-skills/cookies.md +3 -3
- package/dist/extensions/builtin/browser/interaction-skills/cross-origin-iframes.md +3 -3
- package/dist/extensions/builtin/browser/interaction-skills/dialogs.md +64 -64
- package/dist/extensions/builtin/browser/interaction-skills/downloads.md +3 -3
- package/dist/extensions/builtin/browser/interaction-skills/drag-and-drop.md +3 -3
- package/dist/extensions/builtin/browser/interaction-skills/dropdowns.md +3 -3
- package/dist/extensions/builtin/browser/interaction-skills/iframes.md +3 -3
- package/dist/extensions/builtin/browser/interaction-skills/network-requests.md +3 -3
- package/dist/extensions/builtin/browser/interaction-skills/print-as-pdf.md +3 -3
- package/dist/extensions/builtin/browser/interaction-skills/profile-sync.md +90 -90
- package/dist/extensions/builtin/browser/interaction-skills/screenshots.md +17 -17
- package/dist/extensions/builtin/browser/interaction-skills/scrolling.md +3 -3
- package/dist/extensions/builtin/browser/interaction-skills/shadow-dom.md +3 -3
- package/dist/extensions/builtin/browser/interaction-skills/tabs.md +69 -69
- package/dist/extensions/builtin/browser/interaction-skills/uploads.md +1 -1
- package/dist/extensions/builtin/browser/interaction-skills/viewport.md +3 -3
- package/dist/extensions/builtin/browser/src/browser_harness/AGENT.md +15 -15
- package/dist/extensions/builtin/browser/src/browser_harness/__init__.py +8 -8
- package/dist/extensions/builtin/browser/src/browser_harness/_ipc.py +90 -90
- package/dist/extensions/builtin/browser/src/browser_harness/admin.py +722 -722
- package/dist/extensions/builtin/browser/src/browser_harness/daemon.py +328 -328
- package/dist/extensions/builtin/browser/src/browser_harness/helpers.py +396 -396
- package/dist/extensions/builtin/browser/src/browser_harness/run.py +103 -103
- package/dist/extensions/builtin/discipline/skills/brainstorming/SKILL.md +33 -33
- package/dist/extensions/builtin/discipline/skills/executing-plans/SKILL.md +25 -25
- package/dist/extensions/builtin/discipline/skills/finishing-development-branch/SKILL.md +25 -25
- package/dist/extensions/builtin/discipline/skills/receiving-code-review/SKILL.md +22 -22
- package/dist/extensions/builtin/discipline/skills/requesting-code-review/SKILL.md +31 -31
- package/dist/extensions/builtin/discipline/skills/systematic-debugging/SKILL.md +28 -28
- package/dist/extensions/builtin/discipline/skills/test-driven-development/SKILL.md +32 -32
- package/dist/extensions/builtin/discipline/skills/using-git-worktrees/SKILL.md +25 -25
- package/dist/extensions/builtin/discipline/skills/verification-before-completion/SKILL.md +27 -27
- package/dist/extensions/builtin/discipline/skills/writing-plans/SKILL.md +26 -26
- package/dist/extensions/builtin/goal/README.md +67 -67
- package/dist/extensions/builtin/goal/goal-controller.js +1 -1
- package/dist/extensions/builtin/goal/goal-prompts.js +4 -4
- package/dist/extensions/builtin/grub/README.md +112 -112
- package/dist/extensions/builtin/link-world/agent-workspace/README.md +16 -16
- package/dist/extensions/builtin/link-world/internet-search/internet-search.md +65 -65
- package/dist/extensions/builtin/link-world/link-world-agent.md +82 -82
- package/dist/extensions/builtin/link-world/linkworld.md +313 -313
- package/dist/extensions/builtin/link-world/network-routing/network-routing.md +67 -67
- package/dist/extensions/builtin/loop/README.md +92 -92
- package/dist/extensions/builtin/mcp/figma-design.md +68 -68
- package/dist/extensions/builtin/mcp/mcp-management.md +85 -85
- package/dist/extensions/builtin/recap/AGENT.md +15 -15
- package/dist/extensions/builtin/sal/README.md +72 -72
- package/dist/extensions/builtin/security-audit/README.md +289 -289
- package/dist/extensions/builtin/team/AGENT.md +112 -112
- package/dist/extensions/builtin/team/TESTING.md +299 -299
- package/dist/extensions/builtin/token-save/README.md +56 -56
- package/dist/extensions/optional/AGENT.md +10 -10
- package/dist/index.d.ts +5 -30
- package/dist/index.js +1 -1
- package/dist/models.d.ts +7 -0
- package/dist/models.js +1 -0
- package/dist/modes/interactive/theme/dark.json +85 -85
- package/dist/modes/interactive/theme/light.json +84 -84
- package/dist/modes/interactive/theme/theme-schema.json +335 -335
- package/dist/modes/interactive/theme/warm.json +81 -81
- package/dist/node_modules/@pencil-agent/ai/dist/cli.js +0 -0
- package/dist/packages/protocol/src/flags.d.ts +20 -0
- package/dist/packages/protocol/src/flags.js +0 -0
- package/dist/packages/protocol/src/hooks.d.ts +17 -0
- package/dist/packages/protocol/src/hooks.js +0 -0
- package/dist/packages/protocol/src/index.d.ts +4 -2
- package/dist/packages/protocol/src/index.js +1 -1
- package/dist/packages/protocol/src/lifecycle.d.ts +11 -21
- package/dist/public-config.d.ts +12 -0
- package/dist/public-config.js +1 -0
- package/dist/runtime.d.ts +9 -0
- package/dist/runtime.js +1 -0
- package/dist/session-compaction.d.ts +7 -0
- package/dist/session-compaction.js +1 -0
- package/dist/session.d.ts +7 -0
- package/dist/session.js +1 -0
- package/dist/skills.d.ts +7 -0
- package/dist/skills.js +1 -0
- package/dist/tools.d.ts +7 -0
- package/dist/tools.js +1 -0
- package/docs/ACP/345/215/217/350/256/256/351/233/206/346/210/220/345/274/200/345/217/221/346/226/207/346/241/243.md +851 -0
- package/docs/SDK-TESTING.md +364 -0
- package/docs/codex-goal-command-impl.md +1055 -1055
- package/docs/codex-goal-vs-grub.md +500 -500
- package/docs/custom-provider.md +27 -27
- package/docs/extensions.md +27 -27
- package/docs/keybindings.md +27 -27
- package/docs/loop /351/207/215/346/236/204/345/256/214/346/210/220/346/200/273/347/273/223.md" +250 -250
- package/docs/loop /351/207/215/346/236/204/345/256/214/346/210/220/346/212/245/345/221/212.md" +122 -122
- package/docs/loop /351/207/215/346/236/204/346/226/271/346/241/210.md" +1222 -1222
- package/docs/loop /351/207/215/346/236/204/346/226/271/346/241/210/345/256/236/347/216/260/346/212/245/345/221/212.md" +158 -158
- package/docs/loop /351/207/215/346/236/204/346/226/271/346/241/210/345/257/271/346/257/224/345/210/206/346/236/220.md" +128 -128
- package/docs/loop /351/207/215/346/236/204/350/256/241/345/210/222.md" +320 -320
- package/docs/loop-usage-examples.md +214 -214
- package/docs/mem-core/346/212/200/346/234/257/346/226/207/346/241/243.md +593 -0
- package/docs/models.md +27 -27
- package/docs/packages.md +27 -27
- package/docs/pi-design-philosophy.md +457 -457
- package/docs/planmode.md +1987 -1987
- package/docs/prompt-templates.md +27 -27
- package/docs/providers.md +27 -27
- package/docs/sdk.md +27 -27
- package/docs/skills.md +27 -27
- package/docs/startup-performance-optimization.md +301 -0
- package/docs/themes.md +27 -27
- package/docs/tui.md +27 -27
- package/docs//350/256/244/347/237/245/345/234/260/345/233/276.md +47 -0
- package/package.json +190 -162
- package/docs/cc-agent-design.md +0 -1297
- package/docs/cc-tui-design.md +0 -1333
- package/docs/nanoPencil-/345/255/246/344/271/240/350/256/241/345/210/222.md +0 -170
- package/docs/scan-report.md +0 -3820
- package/docs//345/257/271/346/240/207Claude-Code.md +0 -1775
- package/docs//351/230/277/351/207/214/345/267/264/345/267/264/350/264/242/346/212/245/345/210/206/346/236/220/344/271/246.md +0 -261
|
@@ -1,418 +1,418 @@
|
|
|
1
|
-
# YouTube — Data Extraction
|
|
2
|
-
|
|
3
|
-
Field-tested against youtube.com on 2026-04-21.
|
|
4
|
-
No authentication required for any approach documented here.
|
|
5
|
-
|
|
6
|
-
---
|
|
7
|
-
|
|
8
|
-
## Approach 1 (Fastest): oEmbed API — No Auth, No Browser
|
|
9
|
-
|
|
10
|
-
`https://www.youtube.com/oembed?url=https://www.youtube.com/watch?v={VIDEO_ID}&format=json`
|
|
11
|
-
|
|
12
|
-
Returns JSON in ~0.3s. Works for any public video. Does **not** require login.
|
|
13
|
-
|
|
14
|
-
```python
|
|
15
|
-
from helpers import http_get
|
|
16
|
-
import json
|
|
17
|
-
|
|
18
|
-
def youtube_oembed(video_id):
|
|
19
|
-
"""Fetch oEmbed metadata for a YouTube video.
|
|
20
|
-
|
|
21
|
-
Returns title, author, thumbnail URL, and embed iframe HTML.
|
|
22
|
-
"""
|
|
23
|
-
url = f"https://www.youtube.com/oembed?url=https://www.youtube.com/watch?v={video_id}&format=json"
|
|
24
|
-
return json.loads(http_get(url))
|
|
25
|
-
|
|
26
|
-
data = youtube_oembed("dQw4w9WgXcQ")
|
|
27
|
-
# {
|
|
28
|
-
# "title": "Rick Astley - Never Gonna Give You Up (Official Video) (4K Remaster)",
|
|
29
|
-
# "author_name": "Rick Astley",
|
|
30
|
-
# "author_url": "https://www.youtube.com/@RickAstleyYT",
|
|
31
|
-
# "type": "video",
|
|
32
|
-
# "thumbnail_url": "https://i.ytimg.com/vi/dQw4w9WgXcQ/hqdefault.jpg",
|
|
33
|
-
# "thumbnail_width": 480,
|
|
34
|
-
# "thumbnail_height": 360,
|
|
35
|
-
# "width": 200,
|
|
36
|
-
# "height": 113,
|
|
37
|
-
# "version": "1.0",
|
|
38
|
-
# "provider_name": "YouTube",
|
|
39
|
-
# "html": '<iframe width="200" height="113" src="https://www.youtube.com/embed/dQw4w9WgXcQ?feature=oembed" ...>'
|
|
40
|
-
# }
|
|
41
|
-
```
|
|
42
|
-
|
|
43
|
-
### Bulk oEmbed (ThreadPoolExecutor)
|
|
44
|
-
|
|
45
|
-
```python
|
|
46
|
-
from concurrent.futures import ThreadPoolExecutor
|
|
47
|
-
import json
|
|
48
|
-
from helpers import http_get
|
|
49
|
-
|
|
50
|
-
video_ids = ["dQw4w9WgXcQ", "jNQXAC9IVRw", "9bZkp7q19f0"]
|
|
51
|
-
|
|
52
|
-
def fetch_oembed(vid):
|
|
53
|
-
url = f"https://www.youtube.com/oembed?url=https://www.youtube.com/watch?v={vid}&format=json"
|
|
54
|
-
try:
|
|
55
|
-
return json.loads(http_get(url))
|
|
56
|
-
except Exception as e:
|
|
57
|
-
return {"error": str(e), "id": vid}
|
|
58
|
-
|
|
59
|
-
with ThreadPoolExecutor(max_workers=5) as ex:
|
|
60
|
-
results = list(ex.map(fetch_oembed, video_ids))
|
|
61
|
-
# 3 videos: ~4.1s total (YouTube oEmbed is slower than Spotify's; don't use >5 workers)
|
|
62
|
-
```
|
|
63
|
-
|
|
64
|
-
---
|
|
65
|
-
|
|
66
|
-
## Approach 2: Watch Page — Full Metadata via ytInitialPlayerResponse
|
|
67
|
-
|
|
68
|
-
Every `youtube.com/watch?v={ID}` page embeds two JSON blobs in the HTML:
|
|
69
|
-
|
|
70
|
-
- `ytInitialPlayerResponse` — video details, microformat, caption track list
|
|
71
|
-
- `ytInitialData` — comments section structure, related videos
|
|
72
|
-
|
|
73
|
-
### Extract all video metadata
|
|
74
|
-
|
|
75
|
-
```python
|
|
76
|
-
from helpers import http_get
|
|
77
|
-
import json, re
|
|
78
|
-
|
|
79
|
-
def scrape_video(video_id):
|
|
80
|
-
html = http_get(f"https://www.youtube.com/watch?v={video_id}")
|
|
81
|
-
|
|
82
|
-
# ---- ytInitialPlayerResponse ----
|
|
83
|
-
m = re.search(r'var ytInitialPlayerResponse = (\{.*?\});(?:var|</script>)', html, re.DOTALL)
|
|
84
|
-
if not m:
|
|
85
|
-
raise ValueError(f"ytInitialPlayerResponse not found for video {video_id} — video may be private, deleted, or region-blocked")
|
|
86
|
-
pr = json.loads(m.group(1))
|
|
87
|
-
|
|
88
|
-
# Check playability before parsing
|
|
89
|
-
status = pr.get("playabilityStatus", {}).get("status")
|
|
90
|
-
if status == "LOGIN_REQUIRED":
|
|
91
|
-
raise ValueError(f"Video {video_id} is age-restricted or login-gated (playabilityStatus: LOGIN_REQUIRED)")
|
|
92
|
-
if status == "ERROR":
|
|
93
|
-
reason = pr.get("playabilityStatus", {}).get("reason", "unknown")
|
|
94
|
-
raise ValueError(f"Video {video_id} is unavailable: {reason}")
|
|
95
|
-
|
|
96
|
-
vd = pr["videoDetails"]
|
|
97
|
-
mf = pr["microformat"]["playerMicroformatRenderer"]
|
|
98
|
-
caps = pr.get("captions", {}) \
|
|
99
|
-
.get("playerCaptionsTracklistRenderer", {}) \
|
|
100
|
-
.get("captionTracks", [])
|
|
101
|
-
|
|
102
|
-
return {
|
|
103
|
-
# Core
|
|
104
|
-
"video_id": vd["videoId"],
|
|
105
|
-
"title": vd["title"],
|
|
106
|
-
"author": vd["author"],
|
|
107
|
-
"channel_id": vd["channelId"],
|
|
108
|
-
"description": vd["shortDescription"],
|
|
109
|
-
"duration_s": int(vd["lengthSeconds"]),
|
|
110
|
-
"view_count": int(vd["viewCount"]),
|
|
111
|
-
"keywords": vd.get("keywords", []),
|
|
112
|
-
"is_live": vd.get("isLiveContent", False),
|
|
113
|
-
"is_private": vd.get("isPrivate", False),
|
|
114
|
-
# Microformat (richer publishing data)
|
|
115
|
-
"publish_date": mf.get("publishDate"), # ISO 8601, e.g. "2009-10-25T00:57:33-07:00"
|
|
116
|
-
"upload_date": mf.get("uploadDate"),
|
|
117
|
-
"category": mf.get("category"), # e.g. "Music", "Gaming", "Education"
|
|
118
|
-
"like_count": int(mf.get("likeCount", 0)),
|
|
119
|
-
"is_family_safe": mf.get("isFamilySafe"),
|
|
120
|
-
"is_unlisted": mf.get("isUnlisted"),
|
|
121
|
-
"available_countries": mf.get("availableCountries", []), # list of ISO 3166-1 alpha-2 codes
|
|
122
|
-
"channel_name": mf.get("ownerChannelName"),
|
|
123
|
-
"channel_url": mf.get("ownerProfileUrl"),
|
|
124
|
-
"embed_url": mf.get("embed", {}).get("iframeUrl"),
|
|
125
|
-
# Thumbnails (all publicly accessible, no auth)
|
|
126
|
-
"thumbnail_hq": f"https://i.ytimg.com/vi/{video_id}/hqdefault.jpg", # 480×360, always exists
|
|
127
|
-
"thumbnail_max": f"https://i.ytimg.com/vi/{video_id}/maxresdefault.jpg", # 1280×720, may 404
|
|
128
|
-
# Caption tracks — baseUrl is included for reference but returns empty in practice;
|
|
129
|
-
# use the Show Transcript UI flow in the browser instead (see playback.md)
|
|
130
|
-
"caption_tracks": [
|
|
131
|
-
{
|
|
132
|
-
"lang": t.get("languageCode"),
|
|
133
|
-
"name": t.get("name", {}).get("simpleText"),
|
|
134
|
-
"kind": t.get("kind", "manual"), # "manual" or "asr" (auto-generated)
|
|
135
|
-
"base_url": t.get("baseUrl"),
|
|
136
|
-
}
|
|
137
|
-
for t in caps
|
|
138
|
-
],
|
|
139
|
-
}
|
|
140
|
-
|
|
141
|
-
video = scrape_video("dQw4w9WgXcQ")
|
|
142
|
-
# {
|
|
143
|
-
# "video_id": "dQw4w9WgXcQ",
|
|
144
|
-
# "title": "Rick Astley - Never Gonna Give You Up (Official Video) (4K Remaster)",
|
|
145
|
-
# "author": "Rick Astley",
|
|
146
|
-
# "channel_id": "UCuAXFkgsw1L7xaCfnd5JJOw",
|
|
147
|
-
# "duration_s": 213,
|
|
148
|
-
# "view_count": 1764468859,
|
|
149
|
-
# "publish_date": "2009-10-24T23:57:33-07:00",
|
|
150
|
-
# "category": "Music",
|
|
151
|
-
# "like_count": 16942558,
|
|
152
|
-
# "caption_tracks": [
|
|
153
|
-
# {"lang": "en", "name": "English", "kind": "manual"},
|
|
154
|
-
# {"lang": "en", "name": "English (auto-generated)", "kind": "asr"},
|
|
155
|
-
# {"lang": "de-DE", "name": "German (Germany)", "kind": "manual"},
|
|
156
|
-
# ...
|
|
157
|
-
# ],
|
|
158
|
-
# }
|
|
159
|
-
```
|
|
160
|
-
|
|
161
|
-
---
|
|
162
|
-
|
|
163
|
-
## Approach 3: Search Results — No Auth
|
|
164
|
-
|
|
165
|
-
`youtube.com/results?search_query={QUERY}` is server-side rendered. The `ytInitialData` blob contains up to ~20 video results.
|
|
166
|
-
|
|
167
|
-
```python
|
|
168
|
-
from helpers import http_get
|
|
169
|
-
import json, re
|
|
170
|
-
from urllib.parse import quote_plus
|
|
171
|
-
|
|
172
|
-
def youtube_search(query, max_results=20):
|
|
173
|
-
"""Search YouTube videos without a browser or API key."""
|
|
174
|
-
url = f"https://www.youtube.com/results?search_query={quote_plus(query)}"
|
|
175
|
-
html = http_get(url)
|
|
176
|
-
|
|
177
|
-
m = re.search(r'var ytInitialData = (\{.*?\});</script>', html, re.DOTALL)
|
|
178
|
-
data = json.loads(m.group(1))
|
|
179
|
-
|
|
180
|
-
# Walk the nested structure to find videoRenderer items
|
|
181
|
-
section_contents = (
|
|
182
|
-
data.get("contents", {})
|
|
183
|
-
.get("twoColumnSearchResultsRenderer", {})
|
|
184
|
-
.get("primaryContents", {})
|
|
185
|
-
.get("sectionListRenderer", {})
|
|
186
|
-
.get("contents", [])
|
|
187
|
-
)
|
|
188
|
-
|
|
189
|
-
results = []
|
|
190
|
-
for section in section_contents:
|
|
191
|
-
for item in section.get("itemSectionRenderer", {}).get("contents", []):
|
|
192
|
-
vr = item.get("videoRenderer", {})
|
|
193
|
-
if not vr:
|
|
194
|
-
continue
|
|
195
|
-
snippet = vr.get("detailedMetadataSnippets", [])
|
|
196
|
-
desc = (
|
|
197
|
-
"".join(r.get("text", "") for r in snippet[0]["snippetText"]["runs"])
|
|
198
|
-
if snippet else None
|
|
199
|
-
)
|
|
200
|
-
results.append({
|
|
201
|
-
"video_id": vr["videoId"],
|
|
202
|
-
"url": f"https://www.youtube.com/watch?v={vr['videoId']}",
|
|
203
|
-
"title": vr.get("title", {}).get("runs", [{}])[0].get("text"),
|
|
204
|
-
"channel": vr.get("ownerText", {}).get("runs", [{}])[0].get("text"),
|
|
205
|
-
"channel_url": (
|
|
206
|
-
"https://www.youtube.com"
|
|
207
|
-
+ vr.get("ownerText", {}).get("runs", [{}])[0]
|
|
208
|
-
.get("navigationEndpoint", {})
|
|
209
|
-
.get("browseEndpoint", {})
|
|
210
|
-
.get("canonicalBaseUrl", "")
|
|
211
|
-
),
|
|
212
|
-
"duration": vr.get("lengthText", {}).get("simpleText"), # e.g. "3:32"
|
|
213
|
-
"views": vr.get("viewCountText", {}).get("simpleText"), # e.g. "1,764,468,859 views"
|
|
214
|
-
"published": vr.get("publishedTimeText", {}).get("simpleText"), # e.g. "7 years ago"
|
|
215
|
-
"description_snippet": desc,
|
|
216
|
-
"thumbnail": f"https://i.ytimg.com/vi/{vr['videoId']}/hqdefault.jpg",
|
|
217
|
-
})
|
|
218
|
-
if len(results) >= max_results:
|
|
219
|
-
return results # exit both loops immediately
|
|
220
|
-
return results
|
|
221
|
-
|
|
222
|
-
results = youtube_search("python tutorial", max_results=5)
|
|
223
|
-
# Returns up to ~14-20 results (YouTube serves fewer than 20 on first page)
|
|
224
|
-
# [
|
|
225
|
-
# {
|
|
226
|
-
# "video_id": "K5KVEU3aaeQ",
|
|
227
|
-
# "title": "Python Full Course for Beginners",
|
|
228
|
-
# "channel": "Programming with Mosh",
|
|
229
|
-
# "duration": "2:02:21",
|
|
230
|
-
# "views": "6,056,121 views",
|
|
231
|
-
# "published": "1 year ago",
|
|
232
|
-
# }, ...
|
|
233
|
-
# ]
|
|
234
|
-
```
|
|
235
|
-
|
|
236
|
-
---
|
|
237
|
-
|
|
238
|
-
## Approach 4: Channel Metadata — No Auth
|
|
239
|
-
|
|
240
|
-
Channel pages (`youtube.com/@handle` or `youtube.com/channel/{CHANNEL_ID}`) embed metadata in `ytInitialData`.
|
|
241
|
-
|
|
242
|
-
```python
|
|
243
|
-
from helpers import http_get
|
|
244
|
-
import json, re
|
|
245
|
-
|
|
246
|
-
def scrape_channel(handle_or_id):
|
|
247
|
-
"""
|
|
248
|
-
handle_or_id: "@RickAstleyYT" (handle, with @)
|
|
249
|
-
"UCuAXFkgsw1L7xaCfnd5JJOw" (channel ID)
|
|
250
|
-
"""
|
|
251
|
-
if handle_or_id.startswith("UC"):
|
|
252
|
-
url = f"https://www.youtube.com/channel/{handle_or_id}"
|
|
253
|
-
else:
|
|
254
|
-
url = f"https://www.youtube.com/{handle_or_id}"
|
|
255
|
-
|
|
256
|
-
html = http_get(url)
|
|
257
|
-
m = re.search(r'var ytInitialData = (\{.*?\});</script>', html, re.DOTALL)
|
|
258
|
-
data = json.loads(m.group(1))
|
|
259
|
-
|
|
260
|
-
# Canonical metadata (always present)
|
|
261
|
-
cmd = data.get("metadata", {}).get("channelMetadataRenderer", {})
|
|
262
|
-
|
|
263
|
-
# Subscriber count + handle from pageHeaderViewModel
|
|
264
|
-
ph = (
|
|
265
|
-
data.get("header", {})
|
|
266
|
-
.get("pageHeaderRenderer", {})
|
|
267
|
-
.get("content", {})
|
|
268
|
-
.get("pageHeaderViewModel", {})
|
|
269
|
-
)
|
|
270
|
-
meta_parts = [
|
|
271
|
-
part.get("text", {}).get("content", "")
|
|
272
|
-
for row in ph.get("metadata", {})
|
|
273
|
-
.get("contentMetadataViewModel", {})
|
|
274
|
-
.get("metadataRows", [])
|
|
275
|
-
for part in row.get("metadataParts", [])
|
|
276
|
-
]
|
|
277
|
-
# meta_parts is typically: ["@handle", "4.48m subscribers", "N videos"]
|
|
278
|
-
|
|
279
|
-
# Avatar (take the largest source)
|
|
280
|
-
avatar_sources = (
|
|
281
|
-
ph.get("image", {})
|
|
282
|
-
.get("decoratedAvatarViewModel", {})
|
|
283
|
-
.get("avatar", {})
|
|
284
|
-
.get("avatarViewModel", {})
|
|
285
|
-
.get("image", {})
|
|
286
|
-
.get("sources", [])
|
|
287
|
-
)
|
|
288
|
-
avatar_url = avatar_sources[-1]["url"] if avatar_sources else None
|
|
289
|
-
|
|
290
|
-
# Channel banner
|
|
291
|
-
banner_sources = (
|
|
292
|
-
ph.get("banner", {})
|
|
293
|
-
.get("imageBannerViewModel", {})
|
|
294
|
-
.get("image", {})
|
|
295
|
-
.get("sources", [])
|
|
296
|
-
)
|
|
297
|
-
banner_url = banner_sources[-1]["url"] if banner_sources else None
|
|
298
|
-
|
|
299
|
-
return {
|
|
300
|
-
"channel_id": cmd.get("externalId"),
|
|
301
|
-
"title": cmd.get("title"),
|
|
302
|
-
"description": cmd.get("description"),
|
|
303
|
-
"channel_url": cmd.get("channelUrl"),
|
|
304
|
-
"keywords": cmd.get("keywords", ""),
|
|
305
|
-
"handle": meta_parts[0] if len(meta_parts) > 0 else None,
|
|
306
|
-
"subscribers": meta_parts[1] if len(meta_parts) > 1 else None, # e.g. "4.48m subscribers"
|
|
307
|
-
"avatar_url": avatar_url,
|
|
308
|
-
"banner_url": banner_url,
|
|
309
|
-
}
|
|
310
|
-
|
|
311
|
-
channel = scrape_channel("@RickAstleyYT")
|
|
312
|
-
# {
|
|
313
|
-
# "channel_id": "UCuAXFkgsw1L7xaCfnd5JJOw",
|
|
314
|
-
# "title": "Rick Astley",
|
|
315
|
-
# "description": "2026 UK & Ireland Reflection Tour...",
|
|
316
|
-
# "channel_url": "https://www.youtube.com/channel/UCuAXFkgsw1L7xaCfnd5JJOw",
|
|
317
|
-
# "handle": "@RickAstleyYT",
|
|
318
|
-
# "subscribers": "4.48m subscribers",
|
|
319
|
-
# "avatar_url": "https://yt3.googleusercontent.com/...",
|
|
320
|
-
# "banner_url": "https://yt3.googleusercontent.com/...",
|
|
321
|
-
# }
|
|
322
|
-
```
|
|
323
|
-
|
|
324
|
-
---
|
|
325
|
-
|
|
326
|
-
## Thumbnail URLs — All Sizes
|
|
327
|
-
|
|
328
|
-
All thumbnail sizes are publicly accessible without auth. Construct directly from `video_id`:
|
|
329
|
-
|
|
330
|
-
```python
|
|
331
|
-
def thumbnail_urls(video_id):
|
|
332
|
-
base = f"https://i.ytimg.com/vi/{video_id}"
|
|
333
|
-
return {
|
|
334
|
-
"default": f"{base}/default.jpg", # 120×90, always exists
|
|
335
|
-
"medium": f"{base}/mqdefault.jpg", # 320×180, always exists
|
|
336
|
-
"high": f"{base}/hqdefault.jpg", # 480×360, always exists
|
|
337
|
-
"standard": f"{base}/sddefault.jpg", # 640×480, always exists
|
|
338
|
-
"maxres": f"{base}/maxresdefault.jpg", # 1280×720, may 404 on older/low-res videos
|
|
339
|
-
}
|
|
340
|
-
```
|
|
341
|
-
|
|
342
|
-
---
|
|
343
|
-
|
|
344
|
-
## Extract Video ID from Any URL
|
|
345
|
-
|
|
346
|
-
```python
|
|
347
|
-
import re
|
|
348
|
-
|
|
349
|
-
def extract_video_id(url):
|
|
350
|
-
"""Extract YouTube video ID (11-char) from any YouTube URL format."""
|
|
351
|
-
m = re.search(r'(?:v=|/v/|youtu\.be/|/embed/|/shorts/)([A-Za-z0-9_-]{11})', url)
|
|
352
|
-
return m.group(1) if m else None
|
|
353
|
-
|
|
354
|
-
extract_video_id("https://www.youtube.com/watch?v=dQw4w9WgXcQ") # "dQw4w9WgXcQ"
|
|
355
|
-
extract_video_id("https://youtu.be/dQw4w9WgXcQ") # "dQw4w9WgXcQ"
|
|
356
|
-
extract_video_id("https://www.youtube.com/shorts/dQw4w9WgXcQ") # "dQw4w9WgXcQ"
|
|
357
|
-
extract_video_id("https://www.youtube.com/embed/dQw4w9WgXcQ") # "dQw4w9WgXcQ"
|
|
358
|
-
```
|
|
359
|
-
|
|
360
|
-
---
|
|
361
|
-
|
|
362
|
-
## What Requires a Browser
|
|
363
|
-
|
|
364
|
-
The following are **not accessible** via `http_get` and require the CDP browser (see `playback.md`):
|
|
365
|
-
|
|
366
|
-
- **Trending / Explore** (`/feed/trending`) — `ytInitialData` loads but video items are empty without cookies
|
|
367
|
-
- **Playlist contents** — `ytInitialData` returns only microformat; full video list requires session cookies
|
|
368
|
-
- **Comments** — loaded lazily via XHR, not present in initial HTML
|
|
369
|
-
- **Shorts feed** — requires JS hydration
|
|
370
|
-
- **Channel Videos tab** — video list requires cookies for consistent results
|
|
371
|
-
- **Caption text content** — `captionTracks[].baseUrl` URLs return empty bytes regardless of session state; use the browser's Show Transcript UI flow instead (see `playback.md`)
|
|
372
|
-
- **Age-restricted videos** — oEmbed returns HTTP 401; `scrape_video()` raises `ValueError("LOGIN_REQUIRED")`
|
|
373
|
-
|
|
374
|
-
### Watch-page DOM hydration — the wait you need
|
|
375
|
-
|
|
376
|
-
When you do fall through to the browser for watch-page DOM (e.g. because you need a
|
|
377
|
-
rendered UI state, not just metadata), `wait_for_load()` is **not** enough. The `load`
|
|
378
|
-
event fires before YouTube's Polymer components hydrate — `h1.ytd-watch-metadata yt-formatted-string`,
|
|
379
|
-
`ytd-video-owner-renderer #channel-name a`, and `ytd-watch-info-text` all return `null` for
|
|
380
|
-
~2s after load. Add a `wait(3)` after `wait_for_load()` before querying any watch-page
|
|
381
|
-
selector.
|
|
382
|
-
|
|
383
|
-
Field-tested 2026-04-24 on Brave; same behavior observed on ungoogled-chromium. Prefer
|
|
384
|
-
the `http_get` + `ytInitialPlayerResponse` path above — the browser path exists for flows
|
|
385
|
-
that need live UI state, not for reading metadata.
|
|
386
|
-
|
|
387
|
-
---
|
|
388
|
-
|
|
389
|
-
## URL Patterns
|
|
390
|
-
|
|
391
|
-
| Resource | URL pattern |
|
|
392
|
-
|----------------|--------------------------------------------------------------------|
|
|
393
|
-
| Video | `https://www.youtube.com/watch?v={VIDEO_ID}` |
|
|
394
|
-
| Short URL | `https://youtu.be/{VIDEO_ID}` |
|
|
395
|
-
| Shorts | `https://www.youtube.com/shorts/{VIDEO_ID}` |
|
|
396
|
-
| Channel handle | `https://www.youtube.com/@{HANDLE}` |
|
|
397
|
-
| Channel ID | `https://www.youtube.com/channel/{CHANNEL_ID}` |
|
|
398
|
-
| Playlist | `https://www.youtube.com/playlist?list={PLAYLIST_ID}` |
|
|
399
|
-
| Search | `https://www.youtube.com/results?search_query={QUERY}` |
|
|
400
|
-
| oEmbed | `https://www.youtube.com/oembed?url={VIDEO_URL}&format=json` |
|
|
401
|
-
| Thumbnail (HQ) | `https://i.ytimg.com/vi/{VIDEO_ID}/hqdefault.jpg` |
|
|
402
|
-
|
|
403
|
-
---
|
|
404
|
-
|
|
405
|
-
## Gotchas
|
|
406
|
-
|
|
407
|
-
- **`ytInitialPlayerResponse` regex must use non-greedy match with lookahead** — `(\{.*?\});(?:var|</script>)` with `re.DOTALL` is reliable. Do not use `\{.*\}` (greedy) — it consumes the entire rest of the page.
|
|
408
|
-
- **`viewCount` and `lengthSeconds` are strings, not ints** — `vd["viewCount"]` returns `"1764468859"`. Always cast with `int()`.
|
|
409
|
-
- **`likeCount` lives in `microformat`, not `videoDetails`** — `videoDetails` does not expose like count. `microformat.playerMicroformatRenderer.likeCount` is a string integer.
|
|
410
|
-
- **`availableCountries` is a list of ISO 3166-1 alpha-2 codes** — 249 entries for globally available videos. An empty list means region data is unavailable, not that the video is globally blocked.
|
|
411
|
-
- **oEmbed thumbnail is always `hqdefault` (480×360)** — if you need 1280×720, construct the `maxresdefault.jpg` URL directly, but check for 404 on older videos.
|
|
412
|
-
- **Search returns ~14–20 results** — YouTube does not guarantee 20 results. Always iterate `itemSectionRenderer.contents` rather than assuming a fixed count.
|
|
413
|
-
- **Channel subscriber count is a rounded string** — `"4.48m subscribers"`, not an integer. Parse with regex if sorting: `re.search(r'([\d.]+)\s*([km]?)', text, re.I)`.
|
|
414
|
-
- **`meta_parts` order is `[handle, subscribers, video_count]`** — the third element is not always present. Index defensively.
|
|
415
|
-
- **Caption `baseUrl` is not fetchable** — `captionTracks[].baseUrl` contains `expire=` and `signature=` params but returns empty bytes in all tested conditions (plain `http_get`, XHR from within the page, and `fetch()` with cookies). Use the Show Transcript UI in the browser for caption text (see `playback.md`).
|
|
416
|
-
- **Age-restricted videos** — `scrape_video()` raises `ValueError` with a `LOGIN_REQUIRED` message. `oEmbed` returns HTTP 401 (raises `urllib.error.HTTPError`). Neither approach can access age-restricted content without login.
|
|
417
|
-
- **Private / deleted videos** — oEmbed returns HTTP 404 (raises `urllib.error.HTTPError`). Wrap in `try/except`.
|
|
418
|
-
- **`ytInitialData` blob terminator is `;</script>`** — using `re.DOTALL` with `(\{.*?\});</script>` is safe; the blob does not contain `;</script>` internally.
|
|
1
|
+
# YouTube — Data Extraction
|
|
2
|
+
|
|
3
|
+
Field-tested against youtube.com on 2026-04-21.
|
|
4
|
+
No authentication required for any approach documented here.
|
|
5
|
+
|
|
6
|
+
---
|
|
7
|
+
|
|
8
|
+
## Approach 1 (Fastest): oEmbed API — No Auth, No Browser
|
|
9
|
+
|
|
10
|
+
`https://www.youtube.com/oembed?url=https://www.youtube.com/watch?v={VIDEO_ID}&format=json`
|
|
11
|
+
|
|
12
|
+
Returns JSON in ~0.3s. Works for any public video. Does **not** require login.
|
|
13
|
+
|
|
14
|
+
```python
|
|
15
|
+
from helpers import http_get
|
|
16
|
+
import json
|
|
17
|
+
|
|
18
|
+
def youtube_oembed(video_id):
|
|
19
|
+
"""Fetch oEmbed metadata for a YouTube video.
|
|
20
|
+
|
|
21
|
+
Returns title, author, thumbnail URL, and embed iframe HTML.
|
|
22
|
+
"""
|
|
23
|
+
url = f"https://www.youtube.com/oembed?url=https://www.youtube.com/watch?v={video_id}&format=json"
|
|
24
|
+
return json.loads(http_get(url))
|
|
25
|
+
|
|
26
|
+
data = youtube_oembed("dQw4w9WgXcQ")
|
|
27
|
+
# {
|
|
28
|
+
# "title": "Rick Astley - Never Gonna Give You Up (Official Video) (4K Remaster)",
|
|
29
|
+
# "author_name": "Rick Astley",
|
|
30
|
+
# "author_url": "https://www.youtube.com/@RickAstleyYT",
|
|
31
|
+
# "type": "video",
|
|
32
|
+
# "thumbnail_url": "https://i.ytimg.com/vi/dQw4w9WgXcQ/hqdefault.jpg",
|
|
33
|
+
# "thumbnail_width": 480,
|
|
34
|
+
# "thumbnail_height": 360,
|
|
35
|
+
# "width": 200,
|
|
36
|
+
# "height": 113,
|
|
37
|
+
# "version": "1.0",
|
|
38
|
+
# "provider_name": "YouTube",
|
|
39
|
+
# "html": '<iframe width="200" height="113" src="https://www.youtube.com/embed/dQw4w9WgXcQ?feature=oembed" ...>'
|
|
40
|
+
# }
|
|
41
|
+
```
|
|
42
|
+
|
|
43
|
+
### Bulk oEmbed (ThreadPoolExecutor)
|
|
44
|
+
|
|
45
|
+
```python
|
|
46
|
+
from concurrent.futures import ThreadPoolExecutor
|
|
47
|
+
import json
|
|
48
|
+
from helpers import http_get
|
|
49
|
+
|
|
50
|
+
video_ids = ["dQw4w9WgXcQ", "jNQXAC9IVRw", "9bZkp7q19f0"]
|
|
51
|
+
|
|
52
|
+
def fetch_oembed(vid):
|
|
53
|
+
url = f"https://www.youtube.com/oembed?url=https://www.youtube.com/watch?v={vid}&format=json"
|
|
54
|
+
try:
|
|
55
|
+
return json.loads(http_get(url))
|
|
56
|
+
except Exception as e:
|
|
57
|
+
return {"error": str(e), "id": vid}
|
|
58
|
+
|
|
59
|
+
with ThreadPoolExecutor(max_workers=5) as ex:
|
|
60
|
+
results = list(ex.map(fetch_oembed, video_ids))
|
|
61
|
+
# 3 videos: ~4.1s total (YouTube oEmbed is slower than Spotify's; don't use >5 workers)
|
|
62
|
+
```
|
|
63
|
+
|
|
64
|
+
---
|
|
65
|
+
|
|
66
|
+
## Approach 2: Watch Page — Full Metadata via ytInitialPlayerResponse
|
|
67
|
+
|
|
68
|
+
Every `youtube.com/watch?v={ID}` page embeds two JSON blobs in the HTML:
|
|
69
|
+
|
|
70
|
+
- `ytInitialPlayerResponse` — video details, microformat, caption track list
|
|
71
|
+
- `ytInitialData` — comments section structure, related videos
|
|
72
|
+
|
|
73
|
+
### Extract all video metadata
|
|
74
|
+
|
|
75
|
+
```python
|
|
76
|
+
from helpers import http_get
|
|
77
|
+
import json, re
|
|
78
|
+
|
|
79
|
+
def scrape_video(video_id):
|
|
80
|
+
html = http_get(f"https://www.youtube.com/watch?v={video_id}")
|
|
81
|
+
|
|
82
|
+
# ---- ytInitialPlayerResponse ----
|
|
83
|
+
m = re.search(r'var ytInitialPlayerResponse = (\{.*?\});(?:var|</script>)', html, re.DOTALL)
|
|
84
|
+
if not m:
|
|
85
|
+
raise ValueError(f"ytInitialPlayerResponse not found for video {video_id} — video may be private, deleted, or region-blocked")
|
|
86
|
+
pr = json.loads(m.group(1))
|
|
87
|
+
|
|
88
|
+
# Check playability before parsing
|
|
89
|
+
status = pr.get("playabilityStatus", {}).get("status")
|
|
90
|
+
if status == "LOGIN_REQUIRED":
|
|
91
|
+
raise ValueError(f"Video {video_id} is age-restricted or login-gated (playabilityStatus: LOGIN_REQUIRED)")
|
|
92
|
+
if status == "ERROR":
|
|
93
|
+
reason = pr.get("playabilityStatus", {}).get("reason", "unknown")
|
|
94
|
+
raise ValueError(f"Video {video_id} is unavailable: {reason}")
|
|
95
|
+
|
|
96
|
+
vd = pr["videoDetails"]
|
|
97
|
+
mf = pr["microformat"]["playerMicroformatRenderer"]
|
|
98
|
+
caps = pr.get("captions", {}) \
|
|
99
|
+
.get("playerCaptionsTracklistRenderer", {}) \
|
|
100
|
+
.get("captionTracks", [])
|
|
101
|
+
|
|
102
|
+
return {
|
|
103
|
+
# Core
|
|
104
|
+
"video_id": vd["videoId"],
|
|
105
|
+
"title": vd["title"],
|
|
106
|
+
"author": vd["author"],
|
|
107
|
+
"channel_id": vd["channelId"],
|
|
108
|
+
"description": vd["shortDescription"],
|
|
109
|
+
"duration_s": int(vd["lengthSeconds"]),
|
|
110
|
+
"view_count": int(vd["viewCount"]),
|
|
111
|
+
"keywords": vd.get("keywords", []),
|
|
112
|
+
"is_live": vd.get("isLiveContent", False),
|
|
113
|
+
"is_private": vd.get("isPrivate", False),
|
|
114
|
+
# Microformat (richer publishing data)
|
|
115
|
+
"publish_date": mf.get("publishDate"), # ISO 8601, e.g. "2009-10-25T00:57:33-07:00"
|
|
116
|
+
"upload_date": mf.get("uploadDate"),
|
|
117
|
+
"category": mf.get("category"), # e.g. "Music", "Gaming", "Education"
|
|
118
|
+
"like_count": int(mf.get("likeCount", 0)),
|
|
119
|
+
"is_family_safe": mf.get("isFamilySafe"),
|
|
120
|
+
"is_unlisted": mf.get("isUnlisted"),
|
|
121
|
+
"available_countries": mf.get("availableCountries", []), # list of ISO 3166-1 alpha-2 codes
|
|
122
|
+
"channel_name": mf.get("ownerChannelName"),
|
|
123
|
+
"channel_url": mf.get("ownerProfileUrl"),
|
|
124
|
+
"embed_url": mf.get("embed", {}).get("iframeUrl"),
|
|
125
|
+
# Thumbnails (all publicly accessible, no auth)
|
|
126
|
+
"thumbnail_hq": f"https://i.ytimg.com/vi/{video_id}/hqdefault.jpg", # 480×360, always exists
|
|
127
|
+
"thumbnail_max": f"https://i.ytimg.com/vi/{video_id}/maxresdefault.jpg", # 1280×720, may 404
|
|
128
|
+
# Caption tracks — baseUrl is included for reference but returns empty in practice;
|
|
129
|
+
# use the Show Transcript UI flow in the browser instead (see playback.md)
|
|
130
|
+
"caption_tracks": [
|
|
131
|
+
{
|
|
132
|
+
"lang": t.get("languageCode"),
|
|
133
|
+
"name": t.get("name", {}).get("simpleText"),
|
|
134
|
+
"kind": t.get("kind", "manual"), # "manual" or "asr" (auto-generated)
|
|
135
|
+
"base_url": t.get("baseUrl"),
|
|
136
|
+
}
|
|
137
|
+
for t in caps
|
|
138
|
+
],
|
|
139
|
+
}
|
|
140
|
+
|
|
141
|
+
video = scrape_video("dQw4w9WgXcQ")
|
|
142
|
+
# {
|
|
143
|
+
# "video_id": "dQw4w9WgXcQ",
|
|
144
|
+
# "title": "Rick Astley - Never Gonna Give You Up (Official Video) (4K Remaster)",
|
|
145
|
+
# "author": "Rick Astley",
|
|
146
|
+
# "channel_id": "UCuAXFkgsw1L7xaCfnd5JJOw",
|
|
147
|
+
# "duration_s": 213,
|
|
148
|
+
# "view_count": 1764468859,
|
|
149
|
+
# "publish_date": "2009-10-24T23:57:33-07:00",
|
|
150
|
+
# "category": "Music",
|
|
151
|
+
# "like_count": 16942558,
|
|
152
|
+
# "caption_tracks": [
|
|
153
|
+
# {"lang": "en", "name": "English", "kind": "manual"},
|
|
154
|
+
# {"lang": "en", "name": "English (auto-generated)", "kind": "asr"},
|
|
155
|
+
# {"lang": "de-DE", "name": "German (Germany)", "kind": "manual"},
|
|
156
|
+
# ...
|
|
157
|
+
# ],
|
|
158
|
+
# }
|
|
159
|
+
```
|
|
160
|
+
|
|
161
|
+
---
|
|
162
|
+
|
|
163
|
+
## Approach 3: Search Results — No Auth
|
|
164
|
+
|
|
165
|
+
`youtube.com/results?search_query={QUERY}` is server-side rendered. The `ytInitialData` blob contains up to ~20 video results.
|
|
166
|
+
|
|
167
|
+
```python
|
|
168
|
+
from helpers import http_get
|
|
169
|
+
import json, re
|
|
170
|
+
from urllib.parse import quote_plus
|
|
171
|
+
|
|
172
|
+
def youtube_search(query, max_results=20):
|
|
173
|
+
"""Search YouTube videos without a browser or API key."""
|
|
174
|
+
url = f"https://www.youtube.com/results?search_query={quote_plus(query)}"
|
|
175
|
+
html = http_get(url)
|
|
176
|
+
|
|
177
|
+
m = re.search(r'var ytInitialData = (\{.*?\});</script>', html, re.DOTALL)
|
|
178
|
+
data = json.loads(m.group(1))
|
|
179
|
+
|
|
180
|
+
# Walk the nested structure to find videoRenderer items
|
|
181
|
+
section_contents = (
|
|
182
|
+
data.get("contents", {})
|
|
183
|
+
.get("twoColumnSearchResultsRenderer", {})
|
|
184
|
+
.get("primaryContents", {})
|
|
185
|
+
.get("sectionListRenderer", {})
|
|
186
|
+
.get("contents", [])
|
|
187
|
+
)
|
|
188
|
+
|
|
189
|
+
results = []
|
|
190
|
+
for section in section_contents:
|
|
191
|
+
for item in section.get("itemSectionRenderer", {}).get("contents", []):
|
|
192
|
+
vr = item.get("videoRenderer", {})
|
|
193
|
+
if not vr:
|
|
194
|
+
continue
|
|
195
|
+
snippet = vr.get("detailedMetadataSnippets", [])
|
|
196
|
+
desc = (
|
|
197
|
+
"".join(r.get("text", "") for r in snippet[0]["snippetText"]["runs"])
|
|
198
|
+
if snippet else None
|
|
199
|
+
)
|
|
200
|
+
results.append({
|
|
201
|
+
"video_id": vr["videoId"],
|
|
202
|
+
"url": f"https://www.youtube.com/watch?v={vr['videoId']}",
|
|
203
|
+
"title": vr.get("title", {}).get("runs", [{}])[0].get("text"),
|
|
204
|
+
"channel": vr.get("ownerText", {}).get("runs", [{}])[0].get("text"),
|
|
205
|
+
"channel_url": (
|
|
206
|
+
"https://www.youtube.com"
|
|
207
|
+
+ vr.get("ownerText", {}).get("runs", [{}])[0]
|
|
208
|
+
.get("navigationEndpoint", {})
|
|
209
|
+
.get("browseEndpoint", {})
|
|
210
|
+
.get("canonicalBaseUrl", "")
|
|
211
|
+
),
|
|
212
|
+
"duration": vr.get("lengthText", {}).get("simpleText"), # e.g. "3:32"
|
|
213
|
+
"views": vr.get("viewCountText", {}).get("simpleText"), # e.g. "1,764,468,859 views"
|
|
214
|
+
"published": vr.get("publishedTimeText", {}).get("simpleText"), # e.g. "7 years ago"
|
|
215
|
+
"description_snippet": desc,
|
|
216
|
+
"thumbnail": f"https://i.ytimg.com/vi/{vr['videoId']}/hqdefault.jpg",
|
|
217
|
+
})
|
|
218
|
+
if len(results) >= max_results:
|
|
219
|
+
return results # exit both loops immediately
|
|
220
|
+
return results
|
|
221
|
+
|
|
222
|
+
results = youtube_search("python tutorial", max_results=5)
|
|
223
|
+
# Returns up to ~14-20 results (YouTube serves fewer than 20 on first page)
|
|
224
|
+
# [
|
|
225
|
+
# {
|
|
226
|
+
# "video_id": "K5KVEU3aaeQ",
|
|
227
|
+
# "title": "Python Full Course for Beginners",
|
|
228
|
+
# "channel": "Programming with Mosh",
|
|
229
|
+
# "duration": "2:02:21",
|
|
230
|
+
# "views": "6,056,121 views",
|
|
231
|
+
# "published": "1 year ago",
|
|
232
|
+
# }, ...
|
|
233
|
+
# ]
|
|
234
|
+
```
|
|
235
|
+
|
|
236
|
+
---
|
|
237
|
+
|
|
238
|
+
## Approach 4: Channel Metadata — No Auth
|
|
239
|
+
|
|
240
|
+
Channel pages (`youtube.com/@handle` or `youtube.com/channel/{CHANNEL_ID}`) embed metadata in `ytInitialData`.
|
|
241
|
+
|
|
242
|
+
```python
|
|
243
|
+
from helpers import http_get
|
|
244
|
+
import json, re
|
|
245
|
+
|
|
246
|
+
def scrape_channel(handle_or_id):
|
|
247
|
+
"""
|
|
248
|
+
handle_or_id: "@RickAstleyYT" (handle, with @)
|
|
249
|
+
"UCuAXFkgsw1L7xaCfnd5JJOw" (channel ID)
|
|
250
|
+
"""
|
|
251
|
+
if handle_or_id.startswith("UC"):
|
|
252
|
+
url = f"https://www.youtube.com/channel/{handle_or_id}"
|
|
253
|
+
else:
|
|
254
|
+
url = f"https://www.youtube.com/{handle_or_id}"
|
|
255
|
+
|
|
256
|
+
html = http_get(url)
|
|
257
|
+
m = re.search(r'var ytInitialData = (\{.*?\});</script>', html, re.DOTALL)
|
|
258
|
+
data = json.loads(m.group(1))
|
|
259
|
+
|
|
260
|
+
# Canonical metadata (always present)
|
|
261
|
+
cmd = data.get("metadata", {}).get("channelMetadataRenderer", {})
|
|
262
|
+
|
|
263
|
+
# Subscriber count + handle from pageHeaderViewModel
|
|
264
|
+
ph = (
|
|
265
|
+
data.get("header", {})
|
|
266
|
+
.get("pageHeaderRenderer", {})
|
|
267
|
+
.get("content", {})
|
|
268
|
+
.get("pageHeaderViewModel", {})
|
|
269
|
+
)
|
|
270
|
+
meta_parts = [
|
|
271
|
+
part.get("text", {}).get("content", "")
|
|
272
|
+
for row in ph.get("metadata", {})
|
|
273
|
+
.get("contentMetadataViewModel", {})
|
|
274
|
+
.get("metadataRows", [])
|
|
275
|
+
for part in row.get("metadataParts", [])
|
|
276
|
+
]
|
|
277
|
+
# meta_parts is typically: ["@handle", "4.48m subscribers", "N videos"]
|
|
278
|
+
|
|
279
|
+
# Avatar (take the largest source)
|
|
280
|
+
avatar_sources = (
|
|
281
|
+
ph.get("image", {})
|
|
282
|
+
.get("decoratedAvatarViewModel", {})
|
|
283
|
+
.get("avatar", {})
|
|
284
|
+
.get("avatarViewModel", {})
|
|
285
|
+
.get("image", {})
|
|
286
|
+
.get("sources", [])
|
|
287
|
+
)
|
|
288
|
+
avatar_url = avatar_sources[-1]["url"] if avatar_sources else None
|
|
289
|
+
|
|
290
|
+
# Channel banner
|
|
291
|
+
banner_sources = (
|
|
292
|
+
ph.get("banner", {})
|
|
293
|
+
.get("imageBannerViewModel", {})
|
|
294
|
+
.get("image", {})
|
|
295
|
+
.get("sources", [])
|
|
296
|
+
)
|
|
297
|
+
banner_url = banner_sources[-1]["url"] if banner_sources else None
|
|
298
|
+
|
|
299
|
+
return {
|
|
300
|
+
"channel_id": cmd.get("externalId"),
|
|
301
|
+
"title": cmd.get("title"),
|
|
302
|
+
"description": cmd.get("description"),
|
|
303
|
+
"channel_url": cmd.get("channelUrl"),
|
|
304
|
+
"keywords": cmd.get("keywords", ""),
|
|
305
|
+
"handle": meta_parts[0] if len(meta_parts) > 0 else None,
|
|
306
|
+
"subscribers": meta_parts[1] if len(meta_parts) > 1 else None, # e.g. "4.48m subscribers"
|
|
307
|
+
"avatar_url": avatar_url,
|
|
308
|
+
"banner_url": banner_url,
|
|
309
|
+
}
|
|
310
|
+
|
|
311
|
+
channel = scrape_channel("@RickAstleyYT")
|
|
312
|
+
# {
|
|
313
|
+
# "channel_id": "UCuAXFkgsw1L7xaCfnd5JJOw",
|
|
314
|
+
# "title": "Rick Astley",
|
|
315
|
+
# "description": "2026 UK & Ireland Reflection Tour...",
|
|
316
|
+
# "channel_url": "https://www.youtube.com/channel/UCuAXFkgsw1L7xaCfnd5JJOw",
|
|
317
|
+
# "handle": "@RickAstleyYT",
|
|
318
|
+
# "subscribers": "4.48m subscribers",
|
|
319
|
+
# "avatar_url": "https://yt3.googleusercontent.com/...",
|
|
320
|
+
# "banner_url": "https://yt3.googleusercontent.com/...",
|
|
321
|
+
# }
|
|
322
|
+
```
|
|
323
|
+
|
|
324
|
+
---
|
|
325
|
+
|
|
326
|
+
## Thumbnail URLs — All Sizes
|
|
327
|
+
|
|
328
|
+
All thumbnail sizes are publicly accessible without auth. Construct directly from `video_id`:
|
|
329
|
+
|
|
330
|
+
```python
|
|
331
|
+
def thumbnail_urls(video_id):
|
|
332
|
+
base = f"https://i.ytimg.com/vi/{video_id}"
|
|
333
|
+
return {
|
|
334
|
+
"default": f"{base}/default.jpg", # 120×90, always exists
|
|
335
|
+
"medium": f"{base}/mqdefault.jpg", # 320×180, always exists
|
|
336
|
+
"high": f"{base}/hqdefault.jpg", # 480×360, always exists
|
|
337
|
+
"standard": f"{base}/sddefault.jpg", # 640×480, always exists
|
|
338
|
+
"maxres": f"{base}/maxresdefault.jpg", # 1280×720, may 404 on older/low-res videos
|
|
339
|
+
}
|
|
340
|
+
```
|
|
341
|
+
|
|
342
|
+
---
|
|
343
|
+
|
|
344
|
+
## Extract Video ID from Any URL
|
|
345
|
+
|
|
346
|
+
```python
|
|
347
|
+
import re
|
|
348
|
+
|
|
349
|
+
def extract_video_id(url):
|
|
350
|
+
"""Extract YouTube video ID (11-char) from any YouTube URL format."""
|
|
351
|
+
m = re.search(r'(?:v=|/v/|youtu\.be/|/embed/|/shorts/)([A-Za-z0-9_-]{11})', url)
|
|
352
|
+
return m.group(1) if m else None
|
|
353
|
+
|
|
354
|
+
extract_video_id("https://www.youtube.com/watch?v=dQw4w9WgXcQ") # "dQw4w9WgXcQ"
|
|
355
|
+
extract_video_id("https://youtu.be/dQw4w9WgXcQ") # "dQw4w9WgXcQ"
|
|
356
|
+
extract_video_id("https://www.youtube.com/shorts/dQw4w9WgXcQ") # "dQw4w9WgXcQ"
|
|
357
|
+
extract_video_id("https://www.youtube.com/embed/dQw4w9WgXcQ") # "dQw4w9WgXcQ"
|
|
358
|
+
```
|
|
359
|
+
|
|
360
|
+
---
|
|
361
|
+
|
|
362
|
+
## What Requires a Browser
|
|
363
|
+
|
|
364
|
+
The following are **not accessible** via `http_get` and require the CDP browser (see `playback.md`):
|
|
365
|
+
|
|
366
|
+
- **Trending / Explore** (`/feed/trending`) — `ytInitialData` loads but video items are empty without cookies
|
|
367
|
+
- **Playlist contents** — `ytInitialData` returns only microformat; full video list requires session cookies
|
|
368
|
+
- **Comments** — loaded lazily via XHR, not present in initial HTML
|
|
369
|
+
- **Shorts feed** — requires JS hydration
|
|
370
|
+
- **Channel Videos tab** — video list requires cookies for consistent results
|
|
371
|
+
- **Caption text content** — `captionTracks[].baseUrl` URLs return empty bytes regardless of session state; use the browser's Show Transcript UI flow instead (see `playback.md`)
|
|
372
|
+
- **Age-restricted videos** — oEmbed returns HTTP 401; `scrape_video()` raises `ValueError("LOGIN_REQUIRED")`
|
|
373
|
+
|
|
374
|
+
### Watch-page DOM hydration — the wait you need
|
|
375
|
+
|
|
376
|
+
When you do fall through to the browser for watch-page DOM (e.g. because you need a
|
|
377
|
+
rendered UI state, not just metadata), `wait_for_load()` is **not** enough. The `load`
|
|
378
|
+
event fires before YouTube's Polymer components hydrate — `h1.ytd-watch-metadata yt-formatted-string`,
|
|
379
|
+
`ytd-video-owner-renderer #channel-name a`, and `ytd-watch-info-text` all return `null` for
|
|
380
|
+
~2s after load. Add a `wait(3)` after `wait_for_load()` before querying any watch-page
|
|
381
|
+
selector.
|
|
382
|
+
|
|
383
|
+
Field-tested 2026-04-24 on Brave; same behavior observed on ungoogled-chromium. Prefer
|
|
384
|
+
the `http_get` + `ytInitialPlayerResponse` path above — the browser path exists for flows
|
|
385
|
+
that need live UI state, not for reading metadata.
|
|
386
|
+
|
|
387
|
+
---
|
|
388
|
+
|
|
389
|
+
## URL Patterns
|
|
390
|
+
|
|
391
|
+
| Resource | URL pattern |
|
|
392
|
+
|----------------|--------------------------------------------------------------------|
|
|
393
|
+
| Video | `https://www.youtube.com/watch?v={VIDEO_ID}` |
|
|
394
|
+
| Short URL | `https://youtu.be/{VIDEO_ID}` |
|
|
395
|
+
| Shorts | `https://www.youtube.com/shorts/{VIDEO_ID}` |
|
|
396
|
+
| Channel handle | `https://www.youtube.com/@{HANDLE}` |
|
|
397
|
+
| Channel ID | `https://www.youtube.com/channel/{CHANNEL_ID}` |
|
|
398
|
+
| Playlist | `https://www.youtube.com/playlist?list={PLAYLIST_ID}` |
|
|
399
|
+
| Search | `https://www.youtube.com/results?search_query={QUERY}` |
|
|
400
|
+
| oEmbed | `https://www.youtube.com/oembed?url={VIDEO_URL}&format=json` |
|
|
401
|
+
| Thumbnail (HQ) | `https://i.ytimg.com/vi/{VIDEO_ID}/hqdefault.jpg` |
|
|
402
|
+
|
|
403
|
+
---
|
|
404
|
+
|
|
405
|
+
## Gotchas
|
|
406
|
+
|
|
407
|
+
- **`ytInitialPlayerResponse` regex must use non-greedy match with lookahead** — `(\{.*?\});(?:var|</script>)` with `re.DOTALL` is reliable. Do not use `\{.*\}` (greedy) — it consumes the entire rest of the page.
|
|
408
|
+
- **`viewCount` and `lengthSeconds` are strings, not ints** — `vd["viewCount"]` returns `"1764468859"`. Always cast with `int()`.
|
|
409
|
+
- **`likeCount` lives in `microformat`, not `videoDetails`** — `videoDetails` does not expose like count. `microformat.playerMicroformatRenderer.likeCount` is a string integer.
|
|
410
|
+
- **`availableCountries` is a list of ISO 3166-1 alpha-2 codes** — 249 entries for globally available videos. An empty list means region data is unavailable, not that the video is globally blocked.
|
|
411
|
+
- **oEmbed thumbnail is always `hqdefault` (480×360)** — if you need 1280×720, construct the `maxresdefault.jpg` URL directly, but check for 404 on older videos.
|
|
412
|
+
- **Search returns ~14–20 results** — YouTube does not guarantee 20 results. Always iterate `itemSectionRenderer.contents` rather than assuming a fixed count.
|
|
413
|
+
- **Channel subscriber count is a rounded string** — `"4.48m subscribers"`, not an integer. Parse with regex if sorting: `re.search(r'([\d.]+)\s*([km]?)', text, re.I)`.
|
|
414
|
+
- **`meta_parts` order is `[handle, subscribers, video_count]`** — the third element is not always present. Index defensively.
|
|
415
|
+
- **Caption `baseUrl` is not fetchable** — `captionTracks[].baseUrl` contains `expire=` and `signature=` params but returns empty bytes in all tested conditions (plain `http_get`, XHR from within the page, and `fetch()` with cookies). Use the Show Transcript UI in the browser for caption text (see `playback.md`).
|
|
416
|
+
- **Age-restricted videos** — `scrape_video()` raises `ValueError` with a `LOGIN_REQUIRED` message. `oEmbed` returns HTTP 401 (raises `urllib.error.HTTPError`). Neither approach can access age-restricted content without login.
|
|
417
|
+
- **Private / deleted videos** — oEmbed returns HTTP 404 (raises `urllib.error.HTTPError`). Wrap in `try/except`.
|
|
418
|
+
- **`ytInitialData` blob terminator is `;</script>`** — using `re.DOTALL` with `(\{.*?\});</script>` is safe; the blob does not contain `;</script>` internally.
|