@pencil-agent/nano-pencil 2.0.0-beta.8 → 2.0.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
Files changed (241) hide show
  1. package/README.md +267 -267
  2. package/dist/build-meta.json +3 -3
  3. package/dist/core/export-html/AGENT.md +11 -11
  4. package/dist/core/export-html/template.css +971 -971
  5. package/dist/core/export-html/template.html +54 -54
  6. package/dist/core/extensions-host/index.d.ts +1 -1
  7. package/dist/core/extensions-host/loader.js +1 -1
  8. package/dist/core/extensions-host/runner.d.ts +1 -0
  9. package/dist/core/extensions-host/runner.js +2 -2
  10. package/dist/core/extensions-host/types.d.ts +17 -22
  11. package/dist/core/lib/ai/src/types.d.ts +12 -2
  12. package/dist/core/persona/persona-manager.js +5 -2
  13. package/dist/core/runtime/agent-session.js +3 -3
  14. package/dist/core/runtime/extension-core-bindings.d.ts +1 -0
  15. package/dist/core/runtime/extension-core-bindings.js +2 -2
  16. package/dist/extensions/builtin/AGENT.md +115 -115
  17. package/dist/extensions/builtin/browser/AGENT.md +17 -17
  18. package/dist/extensions/builtin/browser/agent-workspace/agent_helpers.py +12 -12
  19. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/amazon/product-search.md +198 -198
  20. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/archive-org/scraping.md +341 -341
  21. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/arxiv/scraping.md +311 -311
  22. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/arxiv-bulk/scraping.md +333 -333
  23. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/atlas/overview.md +70 -70
  24. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/booking-com/scraping.md +578 -578
  25. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/capterra/scraping.md +440 -440
  26. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/centilebrain/generate-estimates.md +110 -110
  27. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/coingecko/scraping.md +325 -325
  28. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/coinmarketcap/scraping.md +463 -463
  29. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/coursera/scraping.md +360 -360
  30. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/craigslist/scraping.md +390 -390
  31. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/crossref/scraping.md +568 -568
  32. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/dev-to/scraping.md +323 -323
  33. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/duckduckgo/scraping.md +349 -349
  34. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/ebay/scraping.md +435 -435
  35. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/etsy/scraping.md +506 -506
  36. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/eventbrite/scraping.md +363 -363
  37. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/expedia/automation.md +168 -168
  38. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/facebook/groups.md +236 -236
  39. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/facebook/pages.md +295 -295
  40. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/framer/editor.md +108 -108
  41. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/fred/scraping.md +493 -493
  42. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/g2/scraping.md +580 -580
  43. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/genius/scraping.md +511 -511
  44. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/github/repo-actions.md +65 -65
  45. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/github/scraping.md +184 -184
  46. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/glassdoor/scraping.md +543 -543
  47. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/gmail/compose.md +122 -122
  48. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/goodreads/scraping.md +461 -461
  49. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/gutenberg/scraping.md +383 -383
  50. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/hackernews/scraping.md +243 -243
  51. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/howlongtobeat/scraping.md +473 -473
  52. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/imdb/scraping.md +271 -271
  53. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/itch-io/scraping.md +436 -436
  54. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/job-boards/indeed-glassdoor.md +1021 -1021
  55. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/letterboxd/scraping.md +349 -349
  56. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/linkedin/invitation-manager.md +109 -109
  57. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/loom/folder-enumeration.md +170 -170
  58. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/macrotrends/scraping.md +537 -537
  59. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/medium/article-hydration.md +120 -120
  60. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/medium/scraping.md +414 -414
  61. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/metacritic/scraping.md +477 -477
  62. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/musicbrainz/scraping.md +478 -478
  63. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/nasa/scraping.md +339 -339
  64. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/news-aggregation/multi-source.md +205 -205
  65. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/open-library/scraping.md +472 -472
  66. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/openalex/scraping.md +470 -470
  67. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/openstreetmap/scraping.md +490 -490
  68. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/package-registries/npm-pypi.md +478 -478
  69. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/polymarket/scraping.md +234 -234
  70. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/producthunt/scraping.md +307 -307
  71. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/pubmed/scraping.md +421 -421
  72. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/quora/scraping.md +364 -364
  73. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/rawg/scraping.md +352 -352
  74. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/reddit/scraping.md +124 -124
  75. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/rest-countries/scraping.md +233 -233
  76. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/sec-edgar/scraping.md +361 -361
  77. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/shopify-admin/README.md +36 -36
  78. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/shopify-admin/embedded-apps.md +72 -72
  79. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/shopify-admin/knowledge-base.md +109 -109
  80. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/shopify-admin/polaris-inputs.md +137 -137
  81. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/soundcloud/scraping.md +362 -362
  82. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/spotify/scraping.md +339 -339
  83. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/stackoverflow/scraping.md +435 -435
  84. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/steam/scraping.md +575 -575
  85. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/substack/scraping.md +338 -338
  86. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/thetechgeeks/pricing.md +52 -52
  87. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/tiktok/upload.md +107 -107
  88. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/tradingview/scraping.md +309 -309
  89. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/trello/boards-and-lists.md +88 -88
  90. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/trustpilot/scraping.md +375 -375
  91. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/walmart/scraping.md +444 -444
  92. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/wayback-machine/scraping.md +306 -306
  93. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/weather/scraping.md +398 -398
  94. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/wellfound/scraping.md +596 -596
  95. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/world-bank/scraping.md +356 -356
  96. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/xiaohongshu/scraping.md +84 -84
  97. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/youtube/scraping.md +418 -418
  98. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/zillow/scraping.md +433 -433
  99. package/dist/extensions/builtin/browser/browser.md +73 -73
  100. package/dist/extensions/builtin/browser/install.md +142 -142
  101. package/dist/extensions/builtin/browser/interaction-skills/connection.md +48 -48
  102. package/dist/extensions/builtin/browser/interaction-skills/cookies.md +3 -3
  103. package/dist/extensions/builtin/browser/interaction-skills/cross-origin-iframes.md +3 -3
  104. package/dist/extensions/builtin/browser/interaction-skills/dialogs.md +64 -64
  105. package/dist/extensions/builtin/browser/interaction-skills/downloads.md +3 -3
  106. package/dist/extensions/builtin/browser/interaction-skills/drag-and-drop.md +3 -3
  107. package/dist/extensions/builtin/browser/interaction-skills/dropdowns.md +3 -3
  108. package/dist/extensions/builtin/browser/interaction-skills/iframes.md +3 -3
  109. package/dist/extensions/builtin/browser/interaction-skills/network-requests.md +3 -3
  110. package/dist/extensions/builtin/browser/interaction-skills/print-as-pdf.md +3 -3
  111. package/dist/extensions/builtin/browser/interaction-skills/profile-sync.md +90 -90
  112. package/dist/extensions/builtin/browser/interaction-skills/screenshots.md +17 -17
  113. package/dist/extensions/builtin/browser/interaction-skills/scrolling.md +3 -3
  114. package/dist/extensions/builtin/browser/interaction-skills/shadow-dom.md +3 -3
  115. package/dist/extensions/builtin/browser/interaction-skills/tabs.md +69 -69
  116. package/dist/extensions/builtin/browser/interaction-skills/uploads.md +1 -1
  117. package/dist/extensions/builtin/browser/interaction-skills/viewport.md +3 -3
  118. package/dist/extensions/builtin/browser/src/browser_harness/AGENT.md +15 -15
  119. package/dist/extensions/builtin/browser/src/browser_harness/__init__.py +8 -8
  120. package/dist/extensions/builtin/browser/src/browser_harness/_ipc.py +90 -90
  121. package/dist/extensions/builtin/browser/src/browser_harness/admin.py +722 -722
  122. package/dist/extensions/builtin/browser/src/browser_harness/daemon.py +328 -328
  123. package/dist/extensions/builtin/browser/src/browser_harness/helpers.py +396 -396
  124. package/dist/extensions/builtin/browser/src/browser_harness/run.py +103 -103
  125. package/dist/extensions/builtin/discipline/skills/brainstorming/SKILL.md +33 -33
  126. package/dist/extensions/builtin/discipline/skills/executing-plans/SKILL.md +25 -25
  127. package/dist/extensions/builtin/discipline/skills/finishing-development-branch/SKILL.md +25 -25
  128. package/dist/extensions/builtin/discipline/skills/receiving-code-review/SKILL.md +22 -22
  129. package/dist/extensions/builtin/discipline/skills/requesting-code-review/SKILL.md +31 -31
  130. package/dist/extensions/builtin/discipline/skills/systematic-debugging/SKILL.md +28 -28
  131. package/dist/extensions/builtin/discipline/skills/test-driven-development/SKILL.md +32 -32
  132. package/dist/extensions/builtin/discipline/skills/using-git-worktrees/SKILL.md +25 -25
  133. package/dist/extensions/builtin/discipline/skills/verification-before-completion/SKILL.md +27 -27
  134. package/dist/extensions/builtin/discipline/skills/writing-plans/SKILL.md +26 -26
  135. package/dist/extensions/builtin/goal/README.md +67 -67
  136. package/dist/extensions/builtin/goal/goal-controller.d.ts +39 -10
  137. package/dist/extensions/builtin/goal/goal-controller.js +1 -1
  138. package/dist/extensions/builtin/goal/goal-format.js +1 -1
  139. package/dist/extensions/builtin/goal/goal-prompts.d.ts +2 -0
  140. package/dist/extensions/builtin/goal/goal-prompts.js +5 -4
  141. package/dist/extensions/builtin/goal/goal-store.js +1 -1
  142. package/dist/extensions/builtin/goal/index.d.ts +1 -1
  143. package/dist/extensions/builtin/goal/index.js +10 -7
  144. package/dist/extensions/builtin/grub/README.md +112 -112
  145. package/dist/extensions/builtin/link-world/agent-workspace/README.md +16 -16
  146. package/dist/extensions/builtin/link-world/index.js +6 -6
  147. package/dist/extensions/builtin/link-world/internet-search/internet-search.md +65 -65
  148. package/dist/extensions/builtin/link-world/link-world-agent.md +82 -82
  149. package/dist/extensions/builtin/link-world/linkworld.md +313 -313
  150. package/dist/extensions/builtin/link-world/{network-routing.md → network-routing/network-routing.md} +67 -67
  151. package/dist/extensions/builtin/loop/README.md +92 -92
  152. package/dist/extensions/builtin/mcp/figma-design.md +68 -68
  153. package/dist/extensions/builtin/mcp/mcp-management.md +85 -85
  154. package/dist/extensions/builtin/plan/index.js +1 -1
  155. package/dist/extensions/builtin/recap/AGENT.md +15 -15
  156. package/dist/extensions/builtin/sal/README.md +72 -72
  157. package/dist/extensions/builtin/security-audit/README.md +289 -289
  158. package/dist/extensions/builtin/task/task-store.d.ts +4 -0
  159. package/dist/extensions/builtin/task/task-store.js +1 -1
  160. package/dist/extensions/builtin/team/AGENT.md +112 -112
  161. package/dist/extensions/builtin/team/TESTING.md +299 -299
  162. package/dist/extensions/builtin/token-save/README.md +56 -56
  163. package/dist/extensions/optional/AGENT.md +10 -10
  164. package/dist/index.d.ts +5 -30
  165. package/dist/index.js +1 -1
  166. package/dist/models.d.ts +7 -0
  167. package/dist/models.js +1 -0
  168. package/dist/modes/interactive/components/footer.js +1 -1
  169. package/dist/modes/interactive/components/task-status-panel.d.ts +36 -0
  170. package/dist/modes/interactive/components/task-status-panel.js +1 -0
  171. package/dist/modes/interactive/controllers/stream-render-controller.d.ts +7 -0
  172. package/dist/modes/interactive/controllers/stream-render-controller.js +2 -2
  173. package/dist/modes/interactive/interactive-mode.js +40 -40
  174. package/dist/modes/interactive/state/interactive-state.d.ts +2 -0
  175. package/dist/modes/interactive/state/interactive-state.js +1 -1
  176. package/dist/modes/interactive/theme/dark.json +85 -85
  177. package/dist/modes/interactive/theme/light.json +84 -84
  178. package/dist/modes/interactive/theme/theme-schema.json +335 -335
  179. package/dist/modes/interactive/theme/warm.json +81 -81
  180. package/dist/node_modules/@pencil-agent/ai/dist/cli.js +0 -0
  181. package/dist/node_modules/@pencil-agent/ai/dist/models.generated.js +1 -1
  182. package/dist/node_modules/@pencil-agent/ai/dist/providers/anthropic.js +2 -2
  183. package/dist/node_modules/@pencil-agent/ai/dist/providers/openai-completions.js +5 -5
  184. package/dist/node_modules/@pencil-agent/ai/dist/providers/openai-responses.js +1 -1
  185. package/dist/node_modules/@pencil-agent/ai/dist/stream.js +1 -1
  186. package/dist/packages/protocol/src/commands.d.ts +33 -0
  187. package/dist/packages/protocol/src/flags.d.ts +20 -0
  188. package/dist/packages/protocol/src/hooks.d.ts +17 -0
  189. package/dist/packages/protocol/src/hooks.js +0 -0
  190. package/dist/packages/{extension-sdk → protocol}/src/index.d.ts +7 -4
  191. package/dist/packages/protocol/src/index.js +1 -0
  192. package/dist/packages/{extension-sdk → protocol}/src/lifecycle.d.ts +15 -27
  193. package/dist/packages/protocol/src/lifecycle.js +0 -0
  194. package/dist/packages/{extension-sdk → protocol}/src/tools.d.ts +1 -1
  195. package/dist/packages/protocol/src/tools.js +0 -0
  196. package/dist/public-config.d.ts +12 -0
  197. package/dist/public-config.js +1 -0
  198. package/dist/runtime.d.ts +9 -0
  199. package/dist/runtime.js +1 -0
  200. package/dist/session-compaction.d.ts +7 -0
  201. package/dist/session-compaction.js +1 -0
  202. package/dist/session.d.ts +7 -0
  203. package/dist/session.js +1 -0
  204. package/dist/skills.d.ts +7 -0
  205. package/dist/skills.js +1 -0
  206. package/dist/tools.d.ts +7 -0
  207. package/dist/tools.js +1 -0
  208. package/docs/ACP/345/215/217/350/256/256/351/233/206/346/210/220/345/274/200/345/217/221/346/226/207/346/241/243.md +851 -0
  209. package/docs/SDK-TESTING.md +364 -0
  210. package/docs/codex-goal-command-impl.md +1055 -1055
  211. package/docs/codex-goal-vs-grub.md +500 -500
  212. package/docs/custom-provider.md +27 -27
  213. package/docs/extensions.md +27 -27
  214. package/docs/keybindings.md +27 -27
  215. package/docs/loop /351/207/215/346/236/204/345/256/214/346/210/220/346/200/273/347/273/223.md" +250 -250
  216. package/docs/loop /351/207/215/346/236/204/345/256/214/346/210/220/346/212/245/345/221/212.md" +122 -122
  217. package/docs/loop /351/207/215/346/236/204/346/226/271/346/241/210.md" +1222 -1222
  218. package/docs/loop /351/207/215/346/236/204/346/226/271/346/241/210/345/256/236/347/216/260/346/212/245/345/221/212.md" +158 -158
  219. package/docs/loop /351/207/215/346/236/204/346/226/271/346/241/210/345/257/271/346/257/224/345/210/206/346/236/220.md" +128 -128
  220. package/docs/loop /351/207/215/346/236/204/350/256/241/345/210/222.md" +320 -320
  221. package/docs/loop-usage-examples.md +214 -214
  222. package/docs/mem-core/346/212/200/346/234/257/346/226/207/346/241/243.md +593 -0
  223. package/docs/models.md +27 -27
  224. package/docs/packages.md +27 -27
  225. package/docs/pi-design-philosophy.md +457 -457
  226. package/docs/planmode.md +1987 -1987
  227. package/docs/prompt-templates.md +27 -27
  228. package/docs/providers.md +27 -27
  229. package/docs/sdk.md +27 -27
  230. package/docs/skills.md +27 -27
  231. package/docs/startup-performance-optimization.md +301 -0
  232. package/docs/themes.md +27 -27
  233. package/docs/tui.md +27 -27
  234. package/docs//350/256/244/347/237/245/345/234/260/345/233/276.md +47 -0
  235. package/package.json +190 -162
  236. package/dist/packages/extension-sdk/src/index.js +0 -1
  237. package/docs/cc-agent-design.md +0 -1297
  238. package/docs/cc-tui-design.md +0 -1333
  239. package/docs//345/257/271/346/240/207Claude-Code.md +0 -1775
  240. /package/dist/packages/{extension-sdk/src/lifecycle.js → protocol/src/commands.js} +0 -0
  241. /package/dist/packages/{extension-sdk/src/tools.js → protocol/src/flags.js} +0 -0
@@ -1,349 +1,349 @@
1
- # Letterboxd — Film Data Scraping
2
-
3
- `https://letterboxd.com` — film logging, rating, and review site. Film pages and user profile root pages are publicly accessible via `http_get` (~200–350ms). Most sub-pages (reviews, ratings, user film lists, browse/genre pages) return 403 and require the browser.
4
-
5
- ## Access path decision table
6
-
7
- | Goal | Method | Latency |
8
- |------|--------|---------|
9
- | Film metadata (title, year, director, cast, genres, rating) | `http_get` + JSON-LD | ~200–350ms |
10
- | Film synopsis, poster, OG data | `http_get` + meta tags | same request |
11
- | Film popular reviews (top 12 inline) | `http_get` film page | same request |
12
- | User profile stats (film count, followers) | `http_get` user root | ~150ms |
13
- | Recent global activity stream | `http_get /films/` | ~200ms |
14
- | User watched film list | browser (`/{username}/films/`) | |
15
- | Ratings distribution histogram | browser (`/film/{slug}/ratings/`) | |
16
- | All reviews (paginated) | browser (`/film/{slug}/reviews/`) | |
17
- | Popular / browse / genre film lists | browser (`/films/popular/`, etc.) | |
18
- | Director / actor pages | browser (`/director/{slug}/`, `/actor/{slug}/`) | |
19
- | User diary / lists | browser (`/{username}/diary/`, `/{username}/lists/`) | |
20
-
21
- **Letterboxd's public API** (`api.letterboxd.com/api/v0/`) returns 401 on all endpoints — it requires OAuth2 client credentials (apply at letterboxd.com/api-beta/).
22
-
23
- **Cloudflare Turnstile** is configured in the page JS but is not blocking `http_get` on accessible pages. It only activates on the login form.
24
-
25
- ---
26
-
27
- ## Path 1: Film page via http_get (fastest for metadata + ratings)
28
-
29
- Film pages at `letterboxd.com/film/{slug}/` are fully accessible. The JSON-LD block (Movie schema) contains everything you need in one parse.
30
-
31
- **URL slug format:** lowercase title, spaces replaced with hyphens. For disambiguation (same title, different year) append `-{year}`: e.g. `parasite-2019`, `alien-1979`.
32
-
33
- ```python
34
- import json, re, html as htmllib
35
- from helpers import http_get
36
-
37
- def extract_film_data(slug):
38
- """
39
- Fetch and parse a Letterboxd film page.
40
- slug examples: 'the-godfather', 'parasite-2019', 'inception', '2001-a-space-odyssey'
41
- """
42
- html = http_get(f"https://letterboxd.com/film/{slug}/")
43
- result = {}
44
-
45
- # --- JSON-LD (primary source) ---
46
- jsonld_raw = re.findall(r'<script type="application/ld\+json">(.*?)</script>', html, re.DOTALL)
47
- for block in jsonld_raw:
48
- # Strip CDATA wrapper that Letterboxd wraps around JSON-LD
49
- cleaned = re.sub(r'/\*\s*<!\[CDATA\[.*?\*/\s*', '', block, flags=re.DOTALL)
50
- cleaned = re.sub(r'/\*\s*\]\]>.*?\*/', '', cleaned, flags=re.DOTALL)
51
- try:
52
- data = json.loads(cleaned.strip())
53
- except json.JSONDecodeError:
54
- continue
55
- if data.get('@type') != 'Movie':
56
- continue
57
-
58
- result['title'] = data['name']
59
- result['year'] = data['releasedEvent'][0]['startDate'] if data.get('releasedEvent') else None
60
- result['directors'] = [d['name'] for d in data.get('director', [])]
61
- result['genres'] = data.get('genre', [])
62
- result['countries'] = [c['name'] for c in data.get('countryOfOrigin', [])]
63
- result['studios'] = [s['name'] for s in data.get('productionCompany', [])]
64
- result['actors'] = [a['name'] for a in data.get('actors', [])]
65
- result['poster_url'] = data.get('image')
66
- result['url'] = data.get('url')
67
- r = data.get('aggregateRating', {})
68
- result['rating'] = r.get('ratingValue') # float 0.0–5.0
69
- result['rating_count'] = r.get('ratingCount') # int, total ratings cast
70
- result['review_count'] = r.get('reviewCount') # int, written reviews only
71
-
72
- # --- OG / meta tags (fast fallback, redundant) ---
73
- og = lambda prop: next(iter(re.findall(
74
- rf'<meta[^>]+property="og:{prop}"[^>]+content="([^"]*)"', html)), None)
75
- result['og_title'] = og('title') # includes year: "The Godfather (1972)"
76
- result['synopsis'] = htmllib.unescape(og('description') or '')
77
- result['og_image'] = og('image') # large 1200x675 crop
78
-
79
- # --- Film ID (internal numeric ID) ---
80
- m = re.search(r'data-film-id="(\d+)"', html)
81
- result['film_id'] = m.group(1) if m else None
82
-
83
- # --- Tagline ---
84
- m = re.search(r'<h4 class="tagline">([^<]+)</h4>', html)
85
- result['tagline'] = htmllib.unescape(m.group(1)) if m else None
86
-
87
- # --- Themes (from tab-genres section) ---
88
- m = re.search(r'<h3><span>Themes</span></h3>.*?<p>(.*?)</p>', html, re.DOTALL)
89
- result['themes'] = re.findall(r'class="text-slug">([^<]+)</a>', m.group(1)) if m else []
90
-
91
- # --- Languages ---
92
- result['languages'] = re.findall(r'href="/films/language/[^/]+/"[^>]*>([^<]+)</a>', html)
93
-
94
- # --- Fans count ---
95
- m = re.search(r'class="accessory"[^>]*>\s*([\d,KkMm]+)\s*fans</a>', html)
96
- result['fans'] = m.group(1) if m else None # e.g. "133K"
97
-
98
- # --- Popular reviews (top 12 inline on the page) ---
99
- result['reviews'] = []
100
- for vid, person, block in re.findall(
101
- r'<article class="production-viewing[^"]*"[^>]*data-viewing-id="(\d+)"[^>]*data-person="([^"]+)">(.*?)</article>',
102
- html, re.DOTALL
103
- ):
104
- dm = re.search(r'<strong class="displayname">([^<]+)</strong>', block)
105
- tm = re.search(r'class="body-text -prose -reset[^"]*"[^>]*>(.*?)</div>', block, re.DOTALL)
106
- lm = re.search(r'data-count="(\d+)"', block)
107
- result['reviews'].append({
108
- 'viewing_id': vid,
109
- 'username': person,
110
- 'display_name': dm.group(1) if dm else person,
111
- 'review': re.sub(r'<[^>]+>', '', tm.group(1)).strip() if tm else '',
112
- 'likes': int(lm.group(1)) if lm else 0,
113
- })
114
-
115
- return result
116
- ```
117
-
118
- ### Verified output (2026-04-18)
119
-
120
- ```python
121
- data = extract_film_data('the-godfather')
122
- # {
123
- # 'title': 'The Godfather',
124
- # 'year': '1972',
125
- # 'directors': ['Francis Ford Coppola'],
126
- # 'genres': ['Crime', 'Drama'],
127
- # 'countries': ['USA'],
128
- # 'studios': ['Paramount Pictures', 'Alfran Productions'],
129
- # 'actors': ['Marlon Brando', 'Al Pacino', 'James Caan', ...], # full cast list
130
- # 'rating': 4.52,
131
- # 'rating_count': 2619662,
132
- # 'review_count': 372579,
133
- # 'fans': '133K',
134
- # 'film_id': '51818',
135
- # 'tagline': "An offer you can't refuse.",
136
- # 'genres': ['Crime', 'Drama'],
137
- # 'themes': ['Crime, drugs and gangsters', 'Gritty crime and ruthless gangsters', ...],
138
- # 'languages': ['English', 'Latin', 'English', 'Italian'], # may have dupes; deduplicate
139
- # 'og_title': 'The Godfather (1972)',
140
- # 'synopsis': 'Spanning the years 1945 to 1955...',
141
- # 'poster_url': 'https://a.ltrbxd.com/resized/film-poster/.../51818-the-godfather-0-230-0-345-crop.jpg...',
142
- # 'og_image': 'https://a.ltrbxd.com/resized/sm/upload/.../the-godfather-1200-1200-675-675-crop-000000.jpg...',
143
- # 'reviews': [
144
- # {'username': 'wizardchurch', 'display_name': 'Hannah', 'likes': 30944,
145
- # 'review': 'haha they made that scene from zootopia into a movie'},
146
- # ... # 12 total
147
- # ]
148
- # }
149
-
150
- data = extract_film_data('parasite-2019')
151
- # title: 'Parasite', year: '2019', rating: 4.53, rating_count: 5264520, review_count: 690652
152
- # fans: '175K', directors: ['Bong Joon Ho'], countries: ['South Korea']
153
-
154
- data = extract_film_data('inception')
155
- # title: 'Inception', year: '2010', rating: 4.23, rating_count: 3913620
156
- ```
157
-
158
- ---
159
-
160
- ## Path 2: User profile via http_get
161
-
162
- Only the user root page `letterboxd.com/{username}/` is accessible. Sub-pages (`/films/`, `/diary/`, `/lists/`) return 403.
163
-
164
- ```python
165
- import re, html as htmllib
166
- from helpers import http_get
167
-
168
- def extract_user_profile(username):
169
- html = http_get(f"https://letterboxd.com/{username}/")
170
-
171
- # Display name
172
- dm = re.search(r'class="displayname tooltip"[^>]*><span class="label">([^<]+)</span>', html)
173
-
174
- # Stats block (Films / This year / Lists / Following / Followers)
175
- stats = re.findall(
176
- r'<span class="value">(\d[\d,]*)</span>'
177
- r'<span class="definition[^"]*">([^<]+)</span>',
178
- html
179
- )
180
-
181
- # Favorites from OG description
182
- od = re.search(r'<meta[^>]+property="og:description"[^>]+content="([^"]*)"', html)
183
- favorites = []
184
- if od:
185
- fm = re.search(r'Favorites:\s*([^.]+)\.', od.group(1))
186
- if fm:
187
- favorites = [f.strip() for f in fm.group(1).split(',')]
188
-
189
- # Film IDs of films shown on profile page (recent activity)
190
- film_ids_on_page = list(set(re.findall(r'data-film-id="(\d+)"', html)))
191
-
192
- return {
193
- 'username': username,
194
- 'display_name': dm.group(1) if dm else None,
195
- 'stats': {label.strip(): int(val.replace(',', '')) for val, label in stats},
196
- 'favorites': favorites,
197
- 'film_ids_on_page': film_ids_on_page,
198
- }
199
- ```
200
-
201
- ### Verified output
202
-
203
- ```python
204
- data = extract_user_profile('dave')
205
- # {
206
- # 'username': 'dave',
207
- # 'display_name': 'Dave Vis',
208
- # 'stats': {'Films': 2553, 'This year': 63, 'Lists': 155, 'Following': 77, 'Followers': 34512},
209
- # 'favorites': ['High and Low (1963)', 'Burning (2018)', 'My Neighbor Totoro (1988)', 'Mulholland Drive (2001)'],
210
- # 'film_ids_on_page': ['51818', '47756', ...] # ~32 film IDs from recent activity blocks
211
- # }
212
- ```
213
-
214
- ---
215
-
216
- ## Path 3: Global activity stream from /films/
217
-
218
- `letterboxd.com/films/` returns the recent global activity feed — approximately 6 full viewing entries, plus many more film slugs from the UI. Use this to discover recently-logged films.
219
-
220
- ```python
221
- import re, html as htmllib
222
- from helpers import http_get
223
-
224
- def extract_activity_stream():
225
- html = http_get("https://letterboxd.com/films/")
226
- entries = []
227
- for owner, obj_id, block in re.findall(
228
- r'class="production-viewing[^"]*"[^>]*data-owner="([^"]+)"[^>]*data-object-id="([^"]+)"[^>]*>(.*?)</article>',
229
- html, re.DOTALL
230
- ):
231
- film_m = re.search(
232
- r'data-item-name="([^"]*)".*?data-item-slug="([^"]*)".*?data-film-id="(\d+)"',
233
- block, re.DOTALL
234
- )
235
- if film_m:
236
- entries.append({
237
- 'owner': owner,
238
- 'film_name': htmllib.unescape(film_m.group(1)),
239
- 'film_slug': film_m.group(2),
240
- 'film_id': film_m.group(3),
241
- })
242
- return entries
243
-
244
- # Returns ~6 entries. Film names are in "Title (Year)" format.
245
- # Example: [{'owner': 'sidduww', 'film_name': 'The Drama (2026)',
246
- # 'film_slug': 'the-drama', 'film_id': '1205494'}, ...]
247
- ```
248
-
249
- ---
250
-
251
- ## Path 4: Browser for list pages and sub-pages (403 via http_get)
252
-
253
- These pages require the browser — use `goto_url()` + `wait_for_load()` + `wait(2)`:
254
-
255
- ```python
256
- from helpers import goto, wait_for_load, wait, js
257
- import json
258
-
259
- # Popular films
260
- goto_url("https://letterboxd.com/films/popular/")
261
- wait_for_load()
262
- wait(2)
263
-
264
- films = json.loads(js("""
265
- (function() {
266
- var items = Array.from(document.querySelectorAll('li.film-list-entry, li[class*="poster-container"]'));
267
- return JSON.stringify(items.slice(0, 30).map(function(el) {
268
- var poster = el.querySelector('[data-item-slug]') || el.querySelector('[data-film-slug]');
269
- return {
270
- name: poster ? (poster.dataset.itemName || poster.dataset.filmName) : null,
271
- slug: poster ? (poster.dataset.itemSlug || poster.dataset.filmSlug) : null,
272
- film_id: poster ? poster.dataset.filmId : null
273
- };
274
- }).filter(function(x){ return x.slug; }));
275
- })()
276
- """))
277
-
278
- # User watched films list (paginated, 72/page)
279
- goto_url("https://letterboxd.com/dave/films/")
280
- wait_for_load()
281
- wait(2)
282
-
283
- films = json.loads(js("""
284
- (function() {
285
- var items = Array.from(document.querySelectorAll('li[data-film-id]'));
286
- return JSON.stringify(items.map(function(el) {
287
- return {
288
- film_id: el.dataset.filmId,
289
- film_slug: el.dataset.targetLink ? el.dataset.targetLink.replace(/\\/film\\/|\\/$/g,'') : null,
290
- rating: el.dataset.ownerRating || null
291
- };
292
- }));
293
- })()
294
- """))
295
-
296
- # User diary entries
297
- goto_url("https://letterboxd.com/dave/diary/")
298
- wait_for_load()
299
- wait(2)
300
-
301
- # For paginated browsing, check next page link
302
- next_page_url = js("""
303
- (function() {
304
- var a = document.querySelector('a.next');
305
- return a ? a.href : null;
306
- })()
307
- """)
308
- # Returns URL for next page or null. Load it with goto_url(next_page_url).
309
- ```
310
-
311
- ---
312
-
313
- ## Gotchas
314
-
315
- **JSON-LD is wrapped in CDATA comments** — `json.loads(block)` will fail without stripping the wrapper. Always strip `/* <![CDATA[ */` and `/* ]]> */` first:
316
- ```python
317
- cleaned = re.sub(r'/\*\s*<!\[CDATA\[.*?\*/\s*', '', block, flags=re.DOTALL)
318
- cleaned = re.sub(r'/\*\s*\]\]>.*?\*/', '', cleaned, flags=re.DOTALL)
319
- data = json.loads(cleaned.strip())
320
- ```
321
-
322
- **JSON-LD `name` is bare title, not "Title (Year)"** — `data['name']` returns `'Parasite'`, not `'Parasite (2019)'`. Year is in `data['releasedEvent'][0]['startDate']`. The OG `og:title` meta tag does include the year.
323
-
324
- **OG description contains HTML entities** — `og:description` and `tagline` use `&#039;` etc. Always call `html.unescape()` on them.
325
-
326
- **`languages` list can have duplicates** — e.g. Parasite returns `['Korean', 'English', 'German', 'Korean']`. Call `list(dict.fromkeys(result['languages']))` to deduplicate while preserving order.
327
-
328
- **Disambiguation slugs** — when two films share a title, Letterboxd appends the year to the slug: `parasite-2019` (Bong's film), vs `parasite` (1982 film). If your slug 404s, try appending `-{year}`.
329
-
330
- **403 pages** — `/film/{slug}/reviews/`, `/film/{slug}/ratings/`, `/film/{slug}/cast/`, `/film/{slug}/details/`, `/{username}/films/`, `/films/popular/`, `/films/by/rating/`, `/genre/{slug}/`, `/director/{slug}/`, `/actor/{slug}/` all return 403 to `http_get`. These require the browser.
331
-
332
- **CSI endpoints are 403** — Letterboxd loads the ratings histogram via `/csi/film/{slug}/rating-histogram/` which returns 403 without a session cookie. Access ratings distribution via browser on `/film/{slug}/ratings/`.
333
-
334
- **`/csi/` and `/ajax/` endpoints need session cookies** — these are used to populate the ratings histogram, friend activity, and popular review sections after page load. Only the inline HTML data (top 12 popular reviews) is available via `http_get`.
335
-
336
- **Cloudflare Turnstile is present but passive** — the `configuration.cloudflare.turnstile` object is in the page JS, but it only activates on the login form. It does not block unauthenticated reads on public film/user pages.
337
-
338
- **The official API requires OAuth** — `api.letterboxd.com/api/v0/` returns 401 on all endpoints. Apply for API access at letterboxd.com/api-beta/ to get client credentials.
339
-
340
- **Fans count is abbreviated** — `'133K'`, `'175K'`. Parse with:
341
- ```python
342
- def parse_abbrev(s):
343
- s = s.strip().upper()
344
- if s.endswith('K'): return int(float(s[:-1]) * 1000)
345
- if s.endswith('M'): return int(float(s[:-1]) * 1000000)
346
- return int(s.replace(',', ''))
347
- ```
348
-
349
- **Film slug from unknown title** — Letterboxd has no public search API. Construct the slug by lowercasing the title and replacing spaces with hyphens, then `http_get` and check for a 403/404 vs a valid JSON-LD block.
1
+ # Letterboxd — Film Data Scraping
2
+
3
+ `https://letterboxd.com` — film logging, rating, and review site. Film pages and user profile root pages are publicly accessible via `http_get` (~200–350ms). Most sub-pages (reviews, ratings, user film lists, browse/genre pages) return 403 and require the browser.
4
+
5
+ ## Access path decision table
6
+
7
+ | Goal | Method | Latency |
8
+ |------|--------|---------|
9
+ | Film metadata (title, year, director, cast, genres, rating) | `http_get` + JSON-LD | ~200–350ms |
10
+ | Film synopsis, poster, OG data | `http_get` + meta tags | same request |
11
+ | Film popular reviews (top 12 inline) | `http_get` film page | same request |
12
+ | User profile stats (film count, followers) | `http_get` user root | ~150ms |
13
+ | Recent global activity stream | `http_get /films/` | ~200ms |
14
+ | User watched film list | browser (`/{username}/films/`) | |
15
+ | Ratings distribution histogram | browser (`/film/{slug}/ratings/`) | |
16
+ | All reviews (paginated) | browser (`/film/{slug}/reviews/`) | |
17
+ | Popular / browse / genre film lists | browser (`/films/popular/`, etc.) | |
18
+ | Director / actor pages | browser (`/director/{slug}/`, `/actor/{slug}/`) | |
19
+ | User diary / lists | browser (`/{username}/diary/`, `/{username}/lists/`) | |
20
+
21
+ **Letterboxd's public API** (`api.letterboxd.com/api/v0/`) returns 401 on all endpoints — it requires OAuth2 client credentials (apply at letterboxd.com/api-beta/).
22
+
23
+ **Cloudflare Turnstile** is configured in the page JS but is not blocking `http_get` on accessible pages. It only activates on the login form.
24
+
25
+ ---
26
+
27
+ ## Path 1: Film page via http_get (fastest for metadata + ratings)
28
+
29
+ Film pages at `letterboxd.com/film/{slug}/` are fully accessible. The JSON-LD block (Movie schema) contains everything you need in one parse.
30
+
31
+ **URL slug format:** lowercase title, spaces replaced with hyphens. For disambiguation (same title, different year) append `-{year}`: e.g. `parasite-2019`, `alien-1979`.
32
+
33
+ ```python
34
+ import json, re, html as htmllib
35
+ from helpers import http_get
36
+
37
+ def extract_film_data(slug):
38
+ """
39
+ Fetch and parse a Letterboxd film page.
40
+ slug examples: 'the-godfather', 'parasite-2019', 'inception', '2001-a-space-odyssey'
41
+ """
42
+ html = http_get(f"https://letterboxd.com/film/{slug}/")
43
+ result = {}
44
+
45
+ # --- JSON-LD (primary source) ---
46
+ jsonld_raw = re.findall(r'<script type="application/ld\+json">(.*?)</script>', html, re.DOTALL)
47
+ for block in jsonld_raw:
48
+ # Strip CDATA wrapper that Letterboxd wraps around JSON-LD
49
+ cleaned = re.sub(r'/\*\s*<!\[CDATA\[.*?\*/\s*', '', block, flags=re.DOTALL)
50
+ cleaned = re.sub(r'/\*\s*\]\]>.*?\*/', '', cleaned, flags=re.DOTALL)
51
+ try:
52
+ data = json.loads(cleaned.strip())
53
+ except json.JSONDecodeError:
54
+ continue
55
+ if data.get('@type') != 'Movie':
56
+ continue
57
+
58
+ result['title'] = data['name']
59
+ result['year'] = data['releasedEvent'][0]['startDate'] if data.get('releasedEvent') else None
60
+ result['directors'] = [d['name'] for d in data.get('director', [])]
61
+ result['genres'] = data.get('genre', [])
62
+ result['countries'] = [c['name'] for c in data.get('countryOfOrigin', [])]
63
+ result['studios'] = [s['name'] for s in data.get('productionCompany', [])]
64
+ result['actors'] = [a['name'] for a in data.get('actors', [])]
65
+ result['poster_url'] = data.get('image')
66
+ result['url'] = data.get('url')
67
+ r = data.get('aggregateRating', {})
68
+ result['rating'] = r.get('ratingValue') # float 0.0–5.0
69
+ result['rating_count'] = r.get('ratingCount') # int, total ratings cast
70
+ result['review_count'] = r.get('reviewCount') # int, written reviews only
71
+
72
+ # --- OG / meta tags (fast fallback, redundant) ---
73
+ og = lambda prop: next(iter(re.findall(
74
+ rf'<meta[^>]+property="og:{prop}"[^>]+content="([^"]*)"', html)), None)
75
+ result['og_title'] = og('title') # includes year: "The Godfather (1972)"
76
+ result['synopsis'] = htmllib.unescape(og('description') or '')
77
+ result['og_image'] = og('image') # large 1200x675 crop
78
+
79
+ # --- Film ID (internal numeric ID) ---
80
+ m = re.search(r'data-film-id="(\d+)"', html)
81
+ result['film_id'] = m.group(1) if m else None
82
+
83
+ # --- Tagline ---
84
+ m = re.search(r'<h4 class="tagline">([^<]+)</h4>', html)
85
+ result['tagline'] = htmllib.unescape(m.group(1)) if m else None
86
+
87
+ # --- Themes (from tab-genres section) ---
88
+ m = re.search(r'<h3><span>Themes</span></h3>.*?<p>(.*?)</p>', html, re.DOTALL)
89
+ result['themes'] = re.findall(r'class="text-slug">([^<]+)</a>', m.group(1)) if m else []
90
+
91
+ # --- Languages ---
92
+ result['languages'] = re.findall(r'href="/films/language/[^/]+/"[^>]*>([^<]+)</a>', html)
93
+
94
+ # --- Fans count ---
95
+ m = re.search(r'class="accessory"[^>]*>\s*([\d,KkMm]+)\s*fans</a>', html)
96
+ result['fans'] = m.group(1) if m else None # e.g. "133K"
97
+
98
+ # --- Popular reviews (top 12 inline on the page) ---
99
+ result['reviews'] = []
100
+ for vid, person, block in re.findall(
101
+ r'<article class="production-viewing[^"]*"[^>]*data-viewing-id="(\d+)"[^>]*data-person="([^"]+)">(.*?)</article>',
102
+ html, re.DOTALL
103
+ ):
104
+ dm = re.search(r'<strong class="displayname">([^<]+)</strong>', block)
105
+ tm = re.search(r'class="body-text -prose -reset[^"]*"[^>]*>(.*?)</div>', block, re.DOTALL)
106
+ lm = re.search(r'data-count="(\d+)"', block)
107
+ result['reviews'].append({
108
+ 'viewing_id': vid,
109
+ 'username': person,
110
+ 'display_name': dm.group(1) if dm else person,
111
+ 'review': re.sub(r'<[^>]+>', '', tm.group(1)).strip() if tm else '',
112
+ 'likes': int(lm.group(1)) if lm else 0,
113
+ })
114
+
115
+ return result
116
+ ```
117
+
118
+ ### Verified output (2026-04-18)
119
+
120
+ ```python
121
+ data = extract_film_data('the-godfather')
122
+ # {
123
+ # 'title': 'The Godfather',
124
+ # 'year': '1972',
125
+ # 'directors': ['Francis Ford Coppola'],
126
+ # 'genres': ['Crime', 'Drama'],
127
+ # 'countries': ['USA'],
128
+ # 'studios': ['Paramount Pictures', 'Alfran Productions'],
129
+ # 'actors': ['Marlon Brando', 'Al Pacino', 'James Caan', ...], # full cast list
130
+ # 'rating': 4.52,
131
+ # 'rating_count': 2619662,
132
+ # 'review_count': 372579,
133
+ # 'fans': '133K',
134
+ # 'film_id': '51818',
135
+ # 'tagline': "An offer you can't refuse.",
136
+ # 'genres': ['Crime', 'Drama'],
137
+ # 'themes': ['Crime, drugs and gangsters', 'Gritty crime and ruthless gangsters', ...],
138
+ # 'languages': ['English', 'Latin', 'English', 'Italian'], # may have dupes; deduplicate
139
+ # 'og_title': 'The Godfather (1972)',
140
+ # 'synopsis': 'Spanning the years 1945 to 1955...',
141
+ # 'poster_url': 'https://a.ltrbxd.com/resized/film-poster/.../51818-the-godfather-0-230-0-345-crop.jpg...',
142
+ # 'og_image': 'https://a.ltrbxd.com/resized/sm/upload/.../the-godfather-1200-1200-675-675-crop-000000.jpg...',
143
+ # 'reviews': [
144
+ # {'username': 'wizardchurch', 'display_name': 'Hannah', 'likes': 30944,
145
+ # 'review': 'haha they made that scene from zootopia into a movie'},
146
+ # ... # 12 total
147
+ # ]
148
+ # }
149
+
150
+ data = extract_film_data('parasite-2019')
151
+ # title: 'Parasite', year: '2019', rating: 4.53, rating_count: 5264520, review_count: 690652
152
+ # fans: '175K', directors: ['Bong Joon Ho'], countries: ['South Korea']
153
+
154
+ data = extract_film_data('inception')
155
+ # title: 'Inception', year: '2010', rating: 4.23, rating_count: 3913620
156
+ ```
157
+
158
+ ---
159
+
160
+ ## Path 2: User profile via http_get
161
+
162
+ Only the user root page `letterboxd.com/{username}/` is accessible. Sub-pages (`/films/`, `/diary/`, `/lists/`) return 403.
163
+
164
+ ```python
165
+ import re, html as htmllib
166
+ from helpers import http_get
167
+
168
+ def extract_user_profile(username):
169
+ html = http_get(f"https://letterboxd.com/{username}/")
170
+
171
+ # Display name
172
+ dm = re.search(r'class="displayname tooltip"[^>]*><span class="label">([^<]+)</span>', html)
173
+
174
+ # Stats block (Films / This year / Lists / Following / Followers)
175
+ stats = re.findall(
176
+ r'<span class="value">(\d[\d,]*)</span>'
177
+ r'<span class="definition[^"]*">([^<]+)</span>',
178
+ html
179
+ )
180
+
181
+ # Favorites from OG description
182
+ od = re.search(r'<meta[^>]+property="og:description"[^>]+content="([^"]*)"', html)
183
+ favorites = []
184
+ if od:
185
+ fm = re.search(r'Favorites:\s*([^.]+)\.', od.group(1))
186
+ if fm:
187
+ favorites = [f.strip() for f in fm.group(1).split(',')]
188
+
189
+ # Film IDs of films shown on profile page (recent activity)
190
+ film_ids_on_page = list(set(re.findall(r'data-film-id="(\d+)"', html)))
191
+
192
+ return {
193
+ 'username': username,
194
+ 'display_name': dm.group(1) if dm else None,
195
+ 'stats': {label.strip(): int(val.replace(',', '')) for val, label in stats},
196
+ 'favorites': favorites,
197
+ 'film_ids_on_page': film_ids_on_page,
198
+ }
199
+ ```
200
+
201
+ ### Verified output
202
+
203
+ ```python
204
+ data = extract_user_profile('dave')
205
+ # {
206
+ # 'username': 'dave',
207
+ # 'display_name': 'Dave Vis',
208
+ # 'stats': {'Films': 2553, 'This year': 63, 'Lists': 155, 'Following': 77, 'Followers': 34512},
209
+ # 'favorites': ['High and Low (1963)', 'Burning (2018)', 'My Neighbor Totoro (1988)', 'Mulholland Drive (2001)'],
210
+ # 'film_ids_on_page': ['51818', '47756', ...] # ~32 film IDs from recent activity blocks
211
+ # }
212
+ ```
213
+
214
+ ---
215
+
216
+ ## Path 3: Global activity stream from /films/
217
+
218
+ `letterboxd.com/films/` returns the recent global activity feed — approximately 6 full viewing entries, plus many more film slugs from the UI. Use this to discover recently-logged films.
219
+
220
+ ```python
221
+ import re, html as htmllib
222
+ from helpers import http_get
223
+
224
+ def extract_activity_stream():
225
+ html = http_get("https://letterboxd.com/films/")
226
+ entries = []
227
+ for owner, obj_id, block in re.findall(
228
+ r'class="production-viewing[^"]*"[^>]*data-owner="([^"]+)"[^>]*data-object-id="([^"]+)"[^>]*>(.*?)</article>',
229
+ html, re.DOTALL
230
+ ):
231
+ film_m = re.search(
232
+ r'data-item-name="([^"]*)".*?data-item-slug="([^"]*)".*?data-film-id="(\d+)"',
233
+ block, re.DOTALL
234
+ )
235
+ if film_m:
236
+ entries.append({
237
+ 'owner': owner,
238
+ 'film_name': htmllib.unescape(film_m.group(1)),
239
+ 'film_slug': film_m.group(2),
240
+ 'film_id': film_m.group(3),
241
+ })
242
+ return entries
243
+
244
+ # Returns ~6 entries. Film names are in "Title (Year)" format.
245
+ # Example: [{'owner': 'sidduww', 'film_name': 'The Drama (2026)',
246
+ # 'film_slug': 'the-drama', 'film_id': '1205494'}, ...]
247
+ ```
248
+
249
+ ---
250
+
251
+ ## Path 4: Browser for list pages and sub-pages (403 via http_get)
252
+
253
+ These pages require the browser — use `goto_url()` + `wait_for_load()` + `wait(2)`:
254
+
255
+ ```python
256
+ from helpers import goto, wait_for_load, wait, js
257
+ import json
258
+
259
+ # Popular films
260
+ goto_url("https://letterboxd.com/films/popular/")
261
+ wait_for_load()
262
+ wait(2)
263
+
264
+ films = json.loads(js("""
265
+ (function() {
266
+ var items = Array.from(document.querySelectorAll('li.film-list-entry, li[class*="poster-container"]'));
267
+ return JSON.stringify(items.slice(0, 30).map(function(el) {
268
+ var poster = el.querySelector('[data-item-slug]') || el.querySelector('[data-film-slug]');
269
+ return {
270
+ name: poster ? (poster.dataset.itemName || poster.dataset.filmName) : null,
271
+ slug: poster ? (poster.dataset.itemSlug || poster.dataset.filmSlug) : null,
272
+ film_id: poster ? poster.dataset.filmId : null
273
+ };
274
+ }).filter(function(x){ return x.slug; }));
275
+ })()
276
+ """))
277
+
278
+ # User watched films list (paginated, 72/page)
279
+ goto_url("https://letterboxd.com/dave/films/")
280
+ wait_for_load()
281
+ wait(2)
282
+
283
+ films = json.loads(js("""
284
+ (function() {
285
+ var items = Array.from(document.querySelectorAll('li[data-film-id]'));
286
+ return JSON.stringify(items.map(function(el) {
287
+ return {
288
+ film_id: el.dataset.filmId,
289
+ film_slug: el.dataset.targetLink ? el.dataset.targetLink.replace(/\\/film\\/|\\/$/g,'') : null,
290
+ rating: el.dataset.ownerRating || null
291
+ };
292
+ }));
293
+ })()
294
+ """))
295
+
296
+ # User diary entries
297
+ goto_url("https://letterboxd.com/dave/diary/")
298
+ wait_for_load()
299
+ wait(2)
300
+
301
+ # For paginated browsing, check next page link
302
+ next_page_url = js("""
303
+ (function() {
304
+ var a = document.querySelector('a.next');
305
+ return a ? a.href : null;
306
+ })()
307
+ """)
308
+ # Returns URL for next page or null. Load it with goto_url(next_page_url).
309
+ ```
310
+
311
+ ---
312
+
313
+ ## Gotchas
314
+
315
+ **JSON-LD is wrapped in CDATA comments** — `json.loads(block)` will fail without stripping the wrapper. Always strip `/* <![CDATA[ */` and `/* ]]> */` first:
316
+ ```python
317
+ cleaned = re.sub(r'/\*\s*<!\[CDATA\[.*?\*/\s*', '', block, flags=re.DOTALL)
318
+ cleaned = re.sub(r'/\*\s*\]\]>.*?\*/', '', cleaned, flags=re.DOTALL)
319
+ data = json.loads(cleaned.strip())
320
+ ```
321
+
322
+ **JSON-LD `name` is bare title, not "Title (Year)"** — `data['name']` returns `'Parasite'`, not `'Parasite (2019)'`. Year is in `data['releasedEvent'][0]['startDate']`. The OG `og:title` meta tag does include the year.
323
+
324
+ **OG description contains HTML entities** — `og:description` and `tagline` use `&#039;` etc. Always call `html.unescape()` on them.
325
+
326
+ **`languages` list can have duplicates** — e.g. Parasite returns `['Korean', 'English', 'German', 'Korean']`. Call `list(dict.fromkeys(result['languages']))` to deduplicate while preserving order.
327
+
328
+ **Disambiguation slugs** — when two films share a title, Letterboxd appends the year to the slug: `parasite-2019` (Bong's film), vs `parasite` (1982 film). If your slug 404s, try appending `-{year}`.
329
+
330
+ **403 pages** — `/film/{slug}/reviews/`, `/film/{slug}/ratings/`, `/film/{slug}/cast/`, `/film/{slug}/details/`, `/{username}/films/`, `/films/popular/`, `/films/by/rating/`, `/genre/{slug}/`, `/director/{slug}/`, `/actor/{slug}/` all return 403 to `http_get`. These require the browser.
331
+
332
+ **CSI endpoints are 403** — Letterboxd loads the ratings histogram via `/csi/film/{slug}/rating-histogram/` which returns 403 without a session cookie. Access ratings distribution via browser on `/film/{slug}/ratings/`.
333
+
334
+ **`/csi/` and `/ajax/` endpoints need session cookies** — these are used to populate the ratings histogram, friend activity, and popular review sections after page load. Only the inline HTML data (top 12 popular reviews) is available via `http_get`.
335
+
336
+ **Cloudflare Turnstile is present but passive** — the `configuration.cloudflare.turnstile` object is in the page JS, but it only activates on the login form. It does not block unauthenticated reads on public film/user pages.
337
+
338
+ **The official API requires OAuth** — `api.letterboxd.com/api/v0/` returns 401 on all endpoints. Apply for API access at letterboxd.com/api-beta/ to get client credentials.
339
+
340
+ **Fans count is abbreviated** — `'133K'`, `'175K'`. Parse with:
341
+ ```python
342
+ def parse_abbrev(s):
343
+ s = s.strip().upper()
344
+ if s.endswith('K'): return int(float(s[:-1]) * 1000)
345
+ if s.endswith('M'): return int(float(s[:-1]) * 1000000)
346
+ return int(s.replace(',', ''))
347
+ ```
348
+
349
+ **Film slug from unknown title** — Letterboxd has no public search API. Construct the slug by lowercasing the title and replacing spaces with hyphens, then `http_get` and check for a 403/404 vs a valid JSON-LD block.