@pencil-agent/nano-pencil 2.0.1 → 2.0.2

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
Files changed (186) hide show
  1. package/README.md +267 -267
  2. package/dist/build-meta.json +3 -3
  3. package/dist/core/export-html/AGENT.md +11 -11
  4. package/dist/core/export-html/template.css +971 -971
  5. package/dist/core/export-html/template.html +54 -54
  6. package/dist/core/model/custom-providers.js +1 -1
  7. package/dist/core/model-registry.js +5 -5
  8. package/dist/extensions/builtin/AGENT.md +115 -115
  9. package/dist/extensions/builtin/browser/AGENT.md +17 -17
  10. package/dist/extensions/builtin/browser/agent-workspace/agent_helpers.py +12 -12
  11. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/amazon/product-search.md +198 -198
  12. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/archive-org/scraping.md +341 -341
  13. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/arxiv/scraping.md +311 -311
  14. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/arxiv-bulk/scraping.md +333 -333
  15. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/atlas/overview.md +70 -70
  16. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/booking-com/scraping.md +578 -578
  17. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/capterra/scraping.md +440 -440
  18. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/centilebrain/generate-estimates.md +110 -110
  19. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/coingecko/scraping.md +325 -325
  20. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/coinmarketcap/scraping.md +463 -463
  21. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/coursera/scraping.md +360 -360
  22. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/craigslist/scraping.md +390 -390
  23. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/crossref/scraping.md +568 -568
  24. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/dev-to/scraping.md +323 -323
  25. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/duckduckgo/scraping.md +349 -349
  26. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/ebay/scraping.md +435 -435
  27. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/etsy/scraping.md +506 -506
  28. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/eventbrite/scraping.md +363 -363
  29. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/expedia/automation.md +168 -168
  30. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/facebook/groups.md +236 -236
  31. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/facebook/pages.md +295 -295
  32. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/framer/editor.md +108 -108
  33. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/fred/scraping.md +493 -493
  34. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/g2/scraping.md +580 -580
  35. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/genius/scraping.md +511 -511
  36. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/github/repo-actions.md +65 -65
  37. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/github/scraping.md +184 -184
  38. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/glassdoor/scraping.md +543 -543
  39. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/gmail/compose.md +122 -122
  40. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/goodreads/scraping.md +461 -461
  41. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/gutenberg/scraping.md +383 -383
  42. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/hackernews/scraping.md +243 -243
  43. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/howlongtobeat/scraping.md +473 -473
  44. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/imdb/scraping.md +271 -271
  45. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/itch-io/scraping.md +436 -436
  46. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/job-boards/indeed-glassdoor.md +1021 -1021
  47. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/letterboxd/scraping.md +349 -349
  48. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/linkedin/invitation-manager.md +109 -109
  49. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/loom/folder-enumeration.md +170 -170
  50. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/macrotrends/scraping.md +537 -537
  51. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/medium/article-hydration.md +120 -120
  52. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/medium/scraping.md +414 -414
  53. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/metacritic/scraping.md +477 -477
  54. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/musicbrainz/scraping.md +478 -478
  55. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/nasa/scraping.md +339 -339
  56. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/news-aggregation/multi-source.md +205 -205
  57. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/open-library/scraping.md +472 -472
  58. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/openalex/scraping.md +470 -470
  59. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/openstreetmap/scraping.md +490 -490
  60. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/package-registries/npm-pypi.md +478 -478
  61. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/polymarket/scraping.md +234 -234
  62. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/producthunt/scraping.md +307 -307
  63. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/pubmed/scraping.md +421 -421
  64. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/quora/scraping.md +364 -364
  65. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/rawg/scraping.md +352 -352
  66. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/reddit/scraping.md +124 -124
  67. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/rest-countries/scraping.md +233 -233
  68. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/sec-edgar/scraping.md +361 -361
  69. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/shopify-admin/README.md +36 -36
  70. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/shopify-admin/embedded-apps.md +72 -72
  71. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/shopify-admin/knowledge-base.md +109 -109
  72. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/shopify-admin/polaris-inputs.md +137 -137
  73. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/soundcloud/scraping.md +362 -362
  74. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/spotify/scraping.md +339 -339
  75. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/stackoverflow/scraping.md +435 -435
  76. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/steam/scraping.md +575 -575
  77. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/substack/scraping.md +338 -338
  78. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/thetechgeeks/pricing.md +52 -52
  79. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/tiktok/upload.md +107 -107
  80. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/tradingview/scraping.md +309 -309
  81. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/trello/boards-and-lists.md +88 -88
  82. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/trustpilot/scraping.md +375 -375
  83. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/walmart/scraping.md +444 -444
  84. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/wayback-machine/scraping.md +306 -306
  85. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/weather/scraping.md +398 -398
  86. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/wellfound/scraping.md +596 -596
  87. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/world-bank/scraping.md +356 -356
  88. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/xiaohongshu/scraping.md +84 -84
  89. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/youtube/scraping.md +418 -418
  90. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/zillow/scraping.md +433 -433
  91. package/dist/extensions/builtin/browser/browser.md +73 -73
  92. package/dist/extensions/builtin/browser/install.md +142 -142
  93. package/dist/extensions/builtin/browser/interaction-skills/connection.md +48 -48
  94. package/dist/extensions/builtin/browser/interaction-skills/cookies.md +3 -3
  95. package/dist/extensions/builtin/browser/interaction-skills/cross-origin-iframes.md +3 -3
  96. package/dist/extensions/builtin/browser/interaction-skills/dialogs.md +64 -64
  97. package/dist/extensions/builtin/browser/interaction-skills/downloads.md +3 -3
  98. package/dist/extensions/builtin/browser/interaction-skills/drag-and-drop.md +3 -3
  99. package/dist/extensions/builtin/browser/interaction-skills/dropdowns.md +3 -3
  100. package/dist/extensions/builtin/browser/interaction-skills/iframes.md +3 -3
  101. package/dist/extensions/builtin/browser/interaction-skills/network-requests.md +3 -3
  102. package/dist/extensions/builtin/browser/interaction-skills/print-as-pdf.md +3 -3
  103. package/dist/extensions/builtin/browser/interaction-skills/profile-sync.md +90 -90
  104. package/dist/extensions/builtin/browser/interaction-skills/screenshots.md +17 -17
  105. package/dist/extensions/builtin/browser/interaction-skills/scrolling.md +3 -3
  106. package/dist/extensions/builtin/browser/interaction-skills/shadow-dom.md +3 -3
  107. package/dist/extensions/builtin/browser/interaction-skills/tabs.md +69 -69
  108. package/dist/extensions/builtin/browser/interaction-skills/uploads.md +1 -1
  109. package/dist/extensions/builtin/browser/interaction-skills/viewport.md +3 -3
  110. package/dist/extensions/builtin/browser/src/browser_harness/AGENT.md +15 -15
  111. package/dist/extensions/builtin/browser/src/browser_harness/__init__.py +8 -8
  112. package/dist/extensions/builtin/browser/src/browser_harness/_ipc.py +90 -90
  113. package/dist/extensions/builtin/browser/src/browser_harness/admin.py +722 -722
  114. package/dist/extensions/builtin/browser/src/browser_harness/daemon.py +328 -328
  115. package/dist/extensions/builtin/browser/src/browser_harness/helpers.py +396 -396
  116. package/dist/extensions/builtin/browser/src/browser_harness/run.py +103 -103
  117. package/dist/extensions/builtin/discipline/skills/brainstorming/SKILL.md +33 -33
  118. package/dist/extensions/builtin/discipline/skills/executing-plans/SKILL.md +25 -25
  119. package/dist/extensions/builtin/discipline/skills/finishing-development-branch/SKILL.md +25 -25
  120. package/dist/extensions/builtin/discipline/skills/receiving-code-review/SKILL.md +22 -22
  121. package/dist/extensions/builtin/discipline/skills/requesting-code-review/SKILL.md +31 -31
  122. package/dist/extensions/builtin/discipline/skills/systematic-debugging/SKILL.md +28 -28
  123. package/dist/extensions/builtin/discipline/skills/test-driven-development/SKILL.md +32 -32
  124. package/dist/extensions/builtin/discipline/skills/using-git-worktrees/SKILL.md +25 -25
  125. package/dist/extensions/builtin/discipline/skills/verification-before-completion/SKILL.md +27 -27
  126. package/dist/extensions/builtin/discipline/skills/writing-plans/SKILL.md +26 -26
  127. package/dist/extensions/builtin/goal/README.md +67 -67
  128. package/dist/extensions/builtin/grub/README.md +112 -112
  129. package/dist/extensions/builtin/link-world/agent-workspace/README.md +16 -16
  130. package/dist/extensions/builtin/link-world/internet-search/internet-search.md +65 -65
  131. package/dist/extensions/builtin/link-world/link-world-agent.md +82 -82
  132. package/dist/extensions/builtin/link-world/linkworld.md +313 -313
  133. package/dist/extensions/builtin/link-world/network-routing/network-routing.md +67 -67
  134. package/dist/extensions/builtin/loop/README.md +92 -92
  135. package/dist/extensions/builtin/mcp/figma-design.md +68 -68
  136. package/dist/extensions/builtin/mcp/mcp-management.md +85 -85
  137. package/dist/extensions/builtin/recap/AGENT.md +15 -15
  138. package/dist/extensions/builtin/sal/README.md +72 -72
  139. package/dist/extensions/builtin/security-audit/README.md +289 -289
  140. package/dist/extensions/builtin/team/AGENT.md +112 -112
  141. package/dist/extensions/builtin/team/TESTING.md +299 -299
  142. package/dist/extensions/builtin/token-save/README.md +56 -56
  143. package/dist/extensions/optional/AGENT.md +10 -10
  144. package/dist/modes/interactive/controllers/input-submit-controller.js +2 -2
  145. package/dist/modes/interactive/controllers/stream-render-controller.js +2 -2
  146. package/dist/modes/interactive/interactive-mode.js +19 -19
  147. package/dist/modes/interactive/theme/dark.json +85 -85
  148. package/dist/modes/interactive/theme/light.json +84 -84
  149. package/dist/modes/interactive/theme/theme-schema.json +335 -335
  150. package/dist/modes/interactive/theme/warm.json +81 -81
  151. package/dist/node_modules/@pencil-agent/ai/dist/cli.js +0 -0
  152. package/dist/node_modules/@pencil-agent/ai/dist/models.generated.js +1 -1
  153. package/docs/ACP/345/215/217/350/256/256/351/233/206/346/210/220/345/274/200/345/217/221/346/226/207/346/241/243.md +851 -0
  154. package/docs/SDK-TESTING.md +364 -0
  155. package/docs/codex-goal-command-impl.md +1055 -1055
  156. package/docs/codex-goal-vs-grub.md +500 -500
  157. package/docs/custom-provider.md +27 -27
  158. package/docs/extensions.md +27 -27
  159. package/docs/keybindings.md +27 -27
  160. package/docs/loop /351/207/215/346/236/204/345/256/214/346/210/220/346/200/273/347/273/223.md" +250 -250
  161. package/docs/loop /351/207/215/346/236/204/345/256/214/346/210/220/346/212/245/345/221/212.md" +122 -122
  162. package/docs/loop /351/207/215/346/236/204/346/226/271/346/241/210.md" +1222 -1222
  163. package/docs/loop /351/207/215/346/236/204/346/226/271/346/241/210/345/256/236/347/216/260/346/212/245/345/221/212.md" +158 -158
  164. package/docs/loop /351/207/215/346/236/204/346/226/271/346/241/210/345/257/271/346/257/224/345/210/206/346/236/220.md" +128 -128
  165. package/docs/loop /351/207/215/346/236/204/350/256/241/345/210/222.md" +320 -320
  166. package/docs/loop-usage-examples.md +214 -214
  167. package/docs/mem-core/346/212/200/346/234/257/346/226/207/346/241/243.md +593 -0
  168. package/docs/models.md +27 -27
  169. package/docs/packages.md +27 -27
  170. package/docs/pi-design-philosophy.md +457 -457
  171. package/docs/planmode.md +1987 -1987
  172. package/docs/prompt-templates.md +27 -27
  173. package/docs/providers.md +27 -27
  174. package/docs/sdk.md +27 -27
  175. package/docs/skills.md +27 -27
  176. package/docs/startup-performance-optimization.md +301 -0
  177. package/docs/themes.md +27 -27
  178. package/docs/tui.md +27 -27
  179. package/docs//350/256/244/347/237/245/345/234/260/345/233/276.md +47 -0
  180. package/package.json +190 -190
  181. package/docs/cc-agent-design.md +0 -1297
  182. package/docs/cc-tui-design.md +0 -1333
  183. package/docs/nanoPencil-/345/255/246/344/271/240/350/256/241/345/210/222.md +0 -170
  184. package/docs/scan-report.md +0 -3820
  185. package/docs//345/257/271/346/240/207Claude-Code.md +0 -1775
  186. package/docs//351/230/277/351/207/214/345/267/264/345/267/264/350/264/242/346/212/245/345/210/206/346/236/220/344/271/246.md +0 -261
@@ -1,124 +1,124 @@
1
- # Reddit — Scraping & Post Extraction
2
-
3
- Reddit's "new" web UI (`reddit.com`) is a Lit / web-components SPA built around custom elements (`shreddit-post`, `shreddit-comment`, `faceplate-*`). This makes DOM extraction unusually reliable — the custom element tags are stable and exposed on the element itself (no hashed class names).
4
-
5
- Use the browser when you're logged in (private subreddits, NSFW gates, rate-limit avoidance). For fully public content, the JSON API path below is faster.
6
-
7
- ## URL patterns
8
-
9
- - Full post: `https://www.reddit.com/r/<sub>/comments/<id>/<slug>/`
10
- - Share short-link: `https://www.reddit.com/r/<sub>/s/<hash>` — redirects to the full URL once the page loads. `new_tab(short_url)` + `wait_for_load()` is enough; by the time you read `location.href` it will be the canonical one.
11
- - Old Reddit: append `/.json` to any post URL for anonymous JSON: `https://www.reddit.com/r/<sub>/comments/<id>/.json`.
12
- - Old UI (simpler DOM, no web components): `https://old.reddit.com/r/<sub>/comments/<id>/` — useful fallback when `shreddit-*` selectors change.
13
-
14
- ## Path 1: JSON API (fastest for public posts)
15
-
16
- ```python
17
- from helpers import http_get
18
- import json
19
-
20
- url = "https://www.reddit.com/r/cursor/comments/1l0u9y7/claude_code_prompt_to_autogenerate_full_cursor/.json"
21
- data = json.loads(http_get(url, headers={"User-Agent": "Mozilla/5.0"}))
22
- post = data[0]["data"]["children"][0]["data"]
23
- # post fields: title, selftext, author, score, num_comments, created_utc, url, permalink
24
- comments = data[1]["data"]["children"] # list of { kind: "t1", data: {...} } or { kind: "more" }
25
- ```
26
-
27
- Fails on:
28
-
29
- - Private / quarantined subreddits (401)
30
- - NSFW posts without an authenticated session
31
- - Anti-scraping 429s under load — back off or switch to the browser path
32
-
33
- ## Path 2: Browser DOM extraction (logged-in)
34
-
35
- Core selector: every post renders inside a single `<shreddit-post>` custom element. Top-level comments are `<shreddit-comment depth="0">`.
36
-
37
- ```bash
38
- browser-harness <<'PY'
39
- new_tab("https://www.reddit.com/r/vibecoding/comments/1kwuqpz/")
40
- wait_for_load()
41
- wait(3.0) # SPA still hydrating after readyState=complete
42
-
43
- # Scroll to force comment tree lazy-load (twice, ~2000px each)
44
- scroll(500, 500, dy=2000); wait(1.0)
45
- scroll(500, 500, dy=2000); wait(1.0)
46
-
47
- data = js(r"""
48
- (()=>{
49
- const postEl = document.querySelector('shreddit-post');
50
- if(!postEl) return null;
51
- const title = (postEl.querySelector('h1, [slot="title"]')||{}).innerText?.trim() || '';
52
- const bodyEl = postEl.querySelector('[slot="text-body"] .md, [slot="text-body"]');
53
- const body = bodyEl ? bodyEl.innerText.trim() : '';
54
- const author = (postEl.querySelector('[slot="authorName"] a, a[data-testid="post_author_link"]')||{}).innerText?.trim() || '';
55
- const subM = location.pathname.match(/^\/r\/([^\/]+)/);
56
- const subreddit = subM ? subM[1] : '';
57
- const scoreEl = postEl.querySelector('faceplate-number');
58
- const score = scoreEl ? scoreEl.getAttribute('number') || scoreEl.innerText : '';
59
- const comments = [];
60
- for(const c of document.querySelectorAll('shreddit-comment[depth="0"]')){
61
- const cBodyEl = c.querySelector('[slot="comment"] .md, [slot="comment"]');
62
- const cBody = cBodyEl ? cBodyEl.innerText.trim() : '';
63
- if(!cBody) continue;
64
- comments.push({
65
- author: c.getAttribute('author') || '',
66
- score: c.getAttribute('score') || '',
67
- body: cBody
68
- });
69
- if(comments.length >= 10) break;
70
- }
71
- return {subreddit, title, author, score, body, comments, url: location.href};
72
- })()
73
- """)
74
- print(data["title"], "·", len(data["body"]), "chars ·", len(data["comments"]), "comments")
75
- PY
76
- ```
77
-
78
- ### Key selectors
79
-
80
- | Target | Selector | Notes |
81
- | ---------------------- | --------------------------------------------------------------------- | ------------------------------------------------------------------------------------------------------- |
82
- | Post container | `shreddit-post` | One per post page. Attributes include `post-title`, `post-id`, `subreddit-name`, `author`. |
83
- | Post title | `shreddit-post h1` or `[slot="title"]` | H1 is also the page title. |
84
- | Post text body | `shreddit-post [slot="text-body"] .md` | `.md` is the rendered markdown container. For link posts this selector returns null (there is no body). |
85
- | Post author name | `[slot="authorName"] a` | Plain text. |
86
- | Vote score | `shreddit-post faceplate-number` | Read the `number` attribute (digit string) — `innerText` is abbreviated ("1.2k"). |
87
- | Top-level comment | `shreddit-comment[depth="0"]` | Depth is an attribute — `depth="1"` is a reply, etc. |
88
- | Comment body | `shreddit-comment [slot="comment"] .md` | Same pattern as post body. |
89
- | Comment author / score | `shreddit-comment` attributes: `author`, `score`, `created-timestamp` | Use `getAttribute`, not DOM descendants. |
90
-
91
- ### Share links
92
-
93
- `/s/<hash>` URLs redirect before the SPA mounts. You don't need to resolve them manually — just `new_tab(url)` + `wait_for_load()` + `wait(2)`, then read `location.href` for the canonical path.
94
-
95
- ### Comment tree lazy-loading
96
-
97
- New Reddit renders only the initial visible comments. To get more, **scroll twice**. `ensureReplies` / `more` placeholders exist but clicking them is brittle; scroll is the most reliable trigger. For a deep thread, loop `scroll + wait` until `shreddit-comment` count stabilizes between passes.
98
-
99
- ### Login / gate detection
100
-
101
- ```python
102
- state = js("""
103
- (()=>{
104
- const loginWall = !!document.querySelector('a[href*="/login"], [data-testid="login-button"]');
105
- const ageGate = !!document.querySelector('[data-testid="nsfw-gate"], shreddit-interstitial');
106
- return {loginWall, ageGate};
107
- })()
108
- """)
109
- ```
110
-
111
- If `ageGate` is true and you are logged in but haven't opted into NSFW content, the gate blocks extraction — toggle NSFW in account settings, not programmatically.
112
-
113
- ## Gotchas
114
-
115
- - **`faceplate-number.innerText` is abbreviated** ("1.2k", "16.6k"). Always prefer `getAttribute('number')` for the exact digit count.
116
- - **`shreddit-comment` is a custom element, not a `<div>`.** CSS descendant selectors still work, but older jQuery-style parent traversals may not — stick to standard DOM.
117
- - **`depth="0"` is a string attribute.** `[depth="0"]` in a CSS selector works; `depth=0` (no quotes) also works in the newer parser, but the quoted form is safest.
118
- - **Collapsed comments render with body still in the DOM, but behind `expando-button`.** The `.md` selector still grabs the text — you don't need to expand.
119
- - **Post body can be empty.** For link posts or image posts, `[slot="text-body"]` doesn't exist; null-check before reading `.innerText`.
120
- - **`wait_for_load()` is not enough.** Reddit sometimes paints the post skeleton before the content hydrates. Add `wait(2.0)`–`wait(3.0)` after `wait_for_load()` or retry reads on null `shreddit-post`.
121
- - **Share URLs (`/s/<hash>`) can't be deep-linked into a comment.** They always land at the post top. If the original raindrop captured `/s/...`, the in-DOM permalink (read from `location.href` after load) is the canonical URL worth storing.
122
- - **Old Reddit (`old.reddit.com`) is a separate DOM** — no `shreddit-*` elements, no `faceplate-*`. If your login session was established on new Reddit, `old.reddit.com` will still honor the cookie.
123
- - **For NSFW or quarantined subs**, the browser path requires your account to have opted in. The JSON API requires OAuth with appropriate scope.
124
- - **`[slot="text-body"] .md .md`** — Reddit occasionally double-wraps; the selector `[slot="text-body"] .md` is the outer one and is what you want. Using `[slot="text-body"]` alone works too, but may include meta text.
1
+ # Reddit — Scraping & Post Extraction
2
+
3
+ Reddit's "new" web UI (`reddit.com`) is a Lit / web-components SPA built around custom elements (`shreddit-post`, `shreddit-comment`, `faceplate-*`). This makes DOM extraction unusually reliable — the custom element tags are stable and exposed on the element itself (no hashed class names).
4
+
5
+ Use the browser when you're logged in (private subreddits, NSFW gates, rate-limit avoidance). For fully public content, the JSON API path below is faster.
6
+
7
+ ## URL patterns
8
+
9
+ - Full post: `https://www.reddit.com/r/<sub>/comments/<id>/<slug>/`
10
+ - Share short-link: `https://www.reddit.com/r/<sub>/s/<hash>` — redirects to the full URL once the page loads. `new_tab(short_url)` + `wait_for_load()` is enough; by the time you read `location.href` it will be the canonical one.
11
+ - Old Reddit: append `/.json` to any post URL for anonymous JSON: `https://www.reddit.com/r/<sub>/comments/<id>/.json`.
12
+ - Old UI (simpler DOM, no web components): `https://old.reddit.com/r/<sub>/comments/<id>/` — useful fallback when `shreddit-*` selectors change.
13
+
14
+ ## Path 1: JSON API (fastest for public posts)
15
+
16
+ ```python
17
+ from helpers import http_get
18
+ import json
19
+
20
+ url = "https://www.reddit.com/r/cursor/comments/1l0u9y7/claude_code_prompt_to_autogenerate_full_cursor/.json"
21
+ data = json.loads(http_get(url, headers={"User-Agent": "Mozilla/5.0"}))
22
+ post = data[0]["data"]["children"][0]["data"]
23
+ # post fields: title, selftext, author, score, num_comments, created_utc, url, permalink
24
+ comments = data[1]["data"]["children"] # list of { kind: "t1", data: {...} } or { kind: "more" }
25
+ ```
26
+
27
+ Fails on:
28
+
29
+ - Private / quarantined subreddits (401)
30
+ - NSFW posts without an authenticated session
31
+ - Anti-scraping 429s under load — back off or switch to the browser path
32
+
33
+ ## Path 2: Browser DOM extraction (logged-in)
34
+
35
+ Core selector: every post renders inside a single `<shreddit-post>` custom element. Top-level comments are `<shreddit-comment depth="0">`.
36
+
37
+ ```bash
38
+ browser-harness <<'PY'
39
+ new_tab("https://www.reddit.com/r/vibecoding/comments/1kwuqpz/")
40
+ wait_for_load()
41
+ wait(3.0) # SPA still hydrating after readyState=complete
42
+
43
+ # Scroll to force comment tree lazy-load (twice, ~2000px each)
44
+ scroll(500, 500, dy=2000); wait(1.0)
45
+ scroll(500, 500, dy=2000); wait(1.0)
46
+
47
+ data = js(r"""
48
+ (()=>{
49
+ const postEl = document.querySelector('shreddit-post');
50
+ if(!postEl) return null;
51
+ const title = (postEl.querySelector('h1, [slot="title"]')||{}).innerText?.trim() || '';
52
+ const bodyEl = postEl.querySelector('[slot="text-body"] .md, [slot="text-body"]');
53
+ const body = bodyEl ? bodyEl.innerText.trim() : '';
54
+ const author = (postEl.querySelector('[slot="authorName"] a, a[data-testid="post_author_link"]')||{}).innerText?.trim() || '';
55
+ const subM = location.pathname.match(/^\/r\/([^\/]+)/);
56
+ const subreddit = subM ? subM[1] : '';
57
+ const scoreEl = postEl.querySelector('faceplate-number');
58
+ const score = scoreEl ? scoreEl.getAttribute('number') || scoreEl.innerText : '';
59
+ const comments = [];
60
+ for(const c of document.querySelectorAll('shreddit-comment[depth="0"]')){
61
+ const cBodyEl = c.querySelector('[slot="comment"] .md, [slot="comment"]');
62
+ const cBody = cBodyEl ? cBodyEl.innerText.trim() : '';
63
+ if(!cBody) continue;
64
+ comments.push({
65
+ author: c.getAttribute('author') || '',
66
+ score: c.getAttribute('score') || '',
67
+ body: cBody
68
+ });
69
+ if(comments.length >= 10) break;
70
+ }
71
+ return {subreddit, title, author, score, body, comments, url: location.href};
72
+ })()
73
+ """)
74
+ print(data["title"], "·", len(data["body"]), "chars ·", len(data["comments"]), "comments")
75
+ PY
76
+ ```
77
+
78
+ ### Key selectors
79
+
80
+ | Target | Selector | Notes |
81
+ | ---------------------- | --------------------------------------------------------------------- | ------------------------------------------------------------------------------------------------------- |
82
+ | Post container | `shreddit-post` | One per post page. Attributes include `post-title`, `post-id`, `subreddit-name`, `author`. |
83
+ | Post title | `shreddit-post h1` or `[slot="title"]` | H1 is also the page title. |
84
+ | Post text body | `shreddit-post [slot="text-body"] .md` | `.md` is the rendered markdown container. For link posts this selector returns null (there is no body). |
85
+ | Post author name | `[slot="authorName"] a` | Plain text. |
86
+ | Vote score | `shreddit-post faceplate-number` | Read the `number` attribute (digit string) — `innerText` is abbreviated ("1.2k"). |
87
+ | Top-level comment | `shreddit-comment[depth="0"]` | Depth is an attribute — `depth="1"` is a reply, etc. |
88
+ | Comment body | `shreddit-comment [slot="comment"] .md` | Same pattern as post body. |
89
+ | Comment author / score | `shreddit-comment` attributes: `author`, `score`, `created-timestamp` | Use `getAttribute`, not DOM descendants. |
90
+
91
+ ### Share links
92
+
93
+ `/s/<hash>` URLs redirect before the SPA mounts. You don't need to resolve them manually — just `new_tab(url)` + `wait_for_load()` + `wait(2)`, then read `location.href` for the canonical path.
94
+
95
+ ### Comment tree lazy-loading
96
+
97
+ New Reddit renders only the initial visible comments. To get more, **scroll twice**. `ensureReplies` / `more` placeholders exist but clicking them is brittle; scroll is the most reliable trigger. For a deep thread, loop `scroll + wait` until `shreddit-comment` count stabilizes between passes.
98
+
99
+ ### Login / gate detection
100
+
101
+ ```python
102
+ state = js("""
103
+ (()=>{
104
+ const loginWall = !!document.querySelector('a[href*="/login"], [data-testid="login-button"]');
105
+ const ageGate = !!document.querySelector('[data-testid="nsfw-gate"], shreddit-interstitial');
106
+ return {loginWall, ageGate};
107
+ })()
108
+ """)
109
+ ```
110
+
111
+ If `ageGate` is true and you are logged in but haven't opted into NSFW content, the gate blocks extraction — toggle NSFW in account settings, not programmatically.
112
+
113
+ ## Gotchas
114
+
115
+ - **`faceplate-number.innerText` is abbreviated** ("1.2k", "16.6k"). Always prefer `getAttribute('number')` for the exact digit count.
116
+ - **`shreddit-comment` is a custom element, not a `<div>`.** CSS descendant selectors still work, but older jQuery-style parent traversals may not — stick to standard DOM.
117
+ - **`depth="0"` is a string attribute.** `[depth="0"]` in a CSS selector works; `depth=0` (no quotes) also works in the newer parser, but the quoted form is safest.
118
+ - **Collapsed comments render with body still in the DOM, but behind `expando-button`.** The `.md` selector still grabs the text — you don't need to expand.
119
+ - **Post body can be empty.** For link posts or image posts, `[slot="text-body"]` doesn't exist; null-check before reading `.innerText`.
120
+ - **`wait_for_load()` is not enough.** Reddit sometimes paints the post skeleton before the content hydrates. Add `wait(2.0)`–`wait(3.0)` after `wait_for_load()` or retry reads on null `shreddit-post`.
121
+ - **Share URLs (`/s/<hash>`) can't be deep-linked into a comment.** They always land at the post top. If the original raindrop captured `/s/...`, the in-DOM permalink (read from `location.href` after load) is the canonical URL worth storing.
122
+ - **Old Reddit (`old.reddit.com`) is a separate DOM** — no `shreddit-*` elements, no `faceplate-*`. If your login session was established on new Reddit, `old.reddit.com` will still honor the cookie.
123
+ - **For NSFW or quarantined subs**, the browser path requires your account to have opted in. The JSON API requires OAuth with appropriate scope.
124
+ - **`[slot="text-body"] .md .md`** — Reddit occasionally double-wraps; the selector `[slot="text-body"] .md` is the outer one and is what you want. Using `[slot="text-body"]` alone works too, but may include meta text.