@pencil-agent/nano-pencil 2.0.0-beta.8 → 2.0.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
Files changed (241) hide show
  1. package/README.md +267 -267
  2. package/dist/build-meta.json +3 -3
  3. package/dist/core/export-html/AGENT.md +11 -11
  4. package/dist/core/export-html/template.css +971 -971
  5. package/dist/core/export-html/template.html +54 -54
  6. package/dist/core/extensions-host/index.d.ts +1 -1
  7. package/dist/core/extensions-host/loader.js +1 -1
  8. package/dist/core/extensions-host/runner.d.ts +1 -0
  9. package/dist/core/extensions-host/runner.js +2 -2
  10. package/dist/core/extensions-host/types.d.ts +17 -22
  11. package/dist/core/lib/ai/src/types.d.ts +12 -2
  12. package/dist/core/persona/persona-manager.js +5 -2
  13. package/dist/core/runtime/agent-session.js +3 -3
  14. package/dist/core/runtime/extension-core-bindings.d.ts +1 -0
  15. package/dist/core/runtime/extension-core-bindings.js +2 -2
  16. package/dist/extensions/builtin/AGENT.md +115 -115
  17. package/dist/extensions/builtin/browser/AGENT.md +17 -17
  18. package/dist/extensions/builtin/browser/agent-workspace/agent_helpers.py +12 -12
  19. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/amazon/product-search.md +198 -198
  20. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/archive-org/scraping.md +341 -341
  21. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/arxiv/scraping.md +311 -311
  22. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/arxiv-bulk/scraping.md +333 -333
  23. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/atlas/overview.md +70 -70
  24. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/booking-com/scraping.md +578 -578
  25. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/capterra/scraping.md +440 -440
  26. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/centilebrain/generate-estimates.md +110 -110
  27. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/coingecko/scraping.md +325 -325
  28. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/coinmarketcap/scraping.md +463 -463
  29. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/coursera/scraping.md +360 -360
  30. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/craigslist/scraping.md +390 -390
  31. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/crossref/scraping.md +568 -568
  32. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/dev-to/scraping.md +323 -323
  33. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/duckduckgo/scraping.md +349 -349
  34. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/ebay/scraping.md +435 -435
  35. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/etsy/scraping.md +506 -506
  36. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/eventbrite/scraping.md +363 -363
  37. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/expedia/automation.md +168 -168
  38. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/facebook/groups.md +236 -236
  39. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/facebook/pages.md +295 -295
  40. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/framer/editor.md +108 -108
  41. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/fred/scraping.md +493 -493
  42. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/g2/scraping.md +580 -580
  43. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/genius/scraping.md +511 -511
  44. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/github/repo-actions.md +65 -65
  45. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/github/scraping.md +184 -184
  46. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/glassdoor/scraping.md +543 -543
  47. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/gmail/compose.md +122 -122
  48. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/goodreads/scraping.md +461 -461
  49. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/gutenberg/scraping.md +383 -383
  50. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/hackernews/scraping.md +243 -243
  51. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/howlongtobeat/scraping.md +473 -473
  52. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/imdb/scraping.md +271 -271
  53. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/itch-io/scraping.md +436 -436
  54. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/job-boards/indeed-glassdoor.md +1021 -1021
  55. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/letterboxd/scraping.md +349 -349
  56. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/linkedin/invitation-manager.md +109 -109
  57. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/loom/folder-enumeration.md +170 -170
  58. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/macrotrends/scraping.md +537 -537
  59. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/medium/article-hydration.md +120 -120
  60. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/medium/scraping.md +414 -414
  61. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/metacritic/scraping.md +477 -477
  62. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/musicbrainz/scraping.md +478 -478
  63. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/nasa/scraping.md +339 -339
  64. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/news-aggregation/multi-source.md +205 -205
  65. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/open-library/scraping.md +472 -472
  66. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/openalex/scraping.md +470 -470
  67. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/openstreetmap/scraping.md +490 -490
  68. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/package-registries/npm-pypi.md +478 -478
  69. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/polymarket/scraping.md +234 -234
  70. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/producthunt/scraping.md +307 -307
  71. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/pubmed/scraping.md +421 -421
  72. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/quora/scraping.md +364 -364
  73. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/rawg/scraping.md +352 -352
  74. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/reddit/scraping.md +124 -124
  75. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/rest-countries/scraping.md +233 -233
  76. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/sec-edgar/scraping.md +361 -361
  77. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/shopify-admin/README.md +36 -36
  78. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/shopify-admin/embedded-apps.md +72 -72
  79. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/shopify-admin/knowledge-base.md +109 -109
  80. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/shopify-admin/polaris-inputs.md +137 -137
  81. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/soundcloud/scraping.md +362 -362
  82. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/spotify/scraping.md +339 -339
  83. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/stackoverflow/scraping.md +435 -435
  84. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/steam/scraping.md +575 -575
  85. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/substack/scraping.md +338 -338
  86. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/thetechgeeks/pricing.md +52 -52
  87. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/tiktok/upload.md +107 -107
  88. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/tradingview/scraping.md +309 -309
  89. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/trello/boards-and-lists.md +88 -88
  90. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/trustpilot/scraping.md +375 -375
  91. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/walmart/scraping.md +444 -444
  92. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/wayback-machine/scraping.md +306 -306
  93. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/weather/scraping.md +398 -398
  94. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/wellfound/scraping.md +596 -596
  95. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/world-bank/scraping.md +356 -356
  96. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/xiaohongshu/scraping.md +84 -84
  97. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/youtube/scraping.md +418 -418
  98. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/zillow/scraping.md +433 -433
  99. package/dist/extensions/builtin/browser/browser.md +73 -73
  100. package/dist/extensions/builtin/browser/install.md +142 -142
  101. package/dist/extensions/builtin/browser/interaction-skills/connection.md +48 -48
  102. package/dist/extensions/builtin/browser/interaction-skills/cookies.md +3 -3
  103. package/dist/extensions/builtin/browser/interaction-skills/cross-origin-iframes.md +3 -3
  104. package/dist/extensions/builtin/browser/interaction-skills/dialogs.md +64 -64
  105. package/dist/extensions/builtin/browser/interaction-skills/downloads.md +3 -3
  106. package/dist/extensions/builtin/browser/interaction-skills/drag-and-drop.md +3 -3
  107. package/dist/extensions/builtin/browser/interaction-skills/dropdowns.md +3 -3
  108. package/dist/extensions/builtin/browser/interaction-skills/iframes.md +3 -3
  109. package/dist/extensions/builtin/browser/interaction-skills/network-requests.md +3 -3
  110. package/dist/extensions/builtin/browser/interaction-skills/print-as-pdf.md +3 -3
  111. package/dist/extensions/builtin/browser/interaction-skills/profile-sync.md +90 -90
  112. package/dist/extensions/builtin/browser/interaction-skills/screenshots.md +17 -17
  113. package/dist/extensions/builtin/browser/interaction-skills/scrolling.md +3 -3
  114. package/dist/extensions/builtin/browser/interaction-skills/shadow-dom.md +3 -3
  115. package/dist/extensions/builtin/browser/interaction-skills/tabs.md +69 -69
  116. package/dist/extensions/builtin/browser/interaction-skills/uploads.md +1 -1
  117. package/dist/extensions/builtin/browser/interaction-skills/viewport.md +3 -3
  118. package/dist/extensions/builtin/browser/src/browser_harness/AGENT.md +15 -15
  119. package/dist/extensions/builtin/browser/src/browser_harness/__init__.py +8 -8
  120. package/dist/extensions/builtin/browser/src/browser_harness/_ipc.py +90 -90
  121. package/dist/extensions/builtin/browser/src/browser_harness/admin.py +722 -722
  122. package/dist/extensions/builtin/browser/src/browser_harness/daemon.py +328 -328
  123. package/dist/extensions/builtin/browser/src/browser_harness/helpers.py +396 -396
  124. package/dist/extensions/builtin/browser/src/browser_harness/run.py +103 -103
  125. package/dist/extensions/builtin/discipline/skills/brainstorming/SKILL.md +33 -33
  126. package/dist/extensions/builtin/discipline/skills/executing-plans/SKILL.md +25 -25
  127. package/dist/extensions/builtin/discipline/skills/finishing-development-branch/SKILL.md +25 -25
  128. package/dist/extensions/builtin/discipline/skills/receiving-code-review/SKILL.md +22 -22
  129. package/dist/extensions/builtin/discipline/skills/requesting-code-review/SKILL.md +31 -31
  130. package/dist/extensions/builtin/discipline/skills/systematic-debugging/SKILL.md +28 -28
  131. package/dist/extensions/builtin/discipline/skills/test-driven-development/SKILL.md +32 -32
  132. package/dist/extensions/builtin/discipline/skills/using-git-worktrees/SKILL.md +25 -25
  133. package/dist/extensions/builtin/discipline/skills/verification-before-completion/SKILL.md +27 -27
  134. package/dist/extensions/builtin/discipline/skills/writing-plans/SKILL.md +26 -26
  135. package/dist/extensions/builtin/goal/README.md +67 -67
  136. package/dist/extensions/builtin/goal/goal-controller.d.ts +39 -10
  137. package/dist/extensions/builtin/goal/goal-controller.js +1 -1
  138. package/dist/extensions/builtin/goal/goal-format.js +1 -1
  139. package/dist/extensions/builtin/goal/goal-prompts.d.ts +2 -0
  140. package/dist/extensions/builtin/goal/goal-prompts.js +5 -4
  141. package/dist/extensions/builtin/goal/goal-store.js +1 -1
  142. package/dist/extensions/builtin/goal/index.d.ts +1 -1
  143. package/dist/extensions/builtin/goal/index.js +10 -7
  144. package/dist/extensions/builtin/grub/README.md +112 -112
  145. package/dist/extensions/builtin/link-world/agent-workspace/README.md +16 -16
  146. package/dist/extensions/builtin/link-world/index.js +6 -6
  147. package/dist/extensions/builtin/link-world/internet-search/internet-search.md +65 -65
  148. package/dist/extensions/builtin/link-world/link-world-agent.md +82 -82
  149. package/dist/extensions/builtin/link-world/linkworld.md +313 -313
  150. package/dist/extensions/builtin/link-world/{network-routing.md → network-routing/network-routing.md} +67 -67
  151. package/dist/extensions/builtin/loop/README.md +92 -92
  152. package/dist/extensions/builtin/mcp/figma-design.md +68 -68
  153. package/dist/extensions/builtin/mcp/mcp-management.md +85 -85
  154. package/dist/extensions/builtin/plan/index.js +1 -1
  155. package/dist/extensions/builtin/recap/AGENT.md +15 -15
  156. package/dist/extensions/builtin/sal/README.md +72 -72
  157. package/dist/extensions/builtin/security-audit/README.md +289 -289
  158. package/dist/extensions/builtin/task/task-store.d.ts +4 -0
  159. package/dist/extensions/builtin/task/task-store.js +1 -1
  160. package/dist/extensions/builtin/team/AGENT.md +112 -112
  161. package/dist/extensions/builtin/team/TESTING.md +299 -299
  162. package/dist/extensions/builtin/token-save/README.md +56 -56
  163. package/dist/extensions/optional/AGENT.md +10 -10
  164. package/dist/index.d.ts +5 -30
  165. package/dist/index.js +1 -1
  166. package/dist/models.d.ts +7 -0
  167. package/dist/models.js +1 -0
  168. package/dist/modes/interactive/components/footer.js +1 -1
  169. package/dist/modes/interactive/components/task-status-panel.d.ts +36 -0
  170. package/dist/modes/interactive/components/task-status-panel.js +1 -0
  171. package/dist/modes/interactive/controllers/stream-render-controller.d.ts +7 -0
  172. package/dist/modes/interactive/controllers/stream-render-controller.js +2 -2
  173. package/dist/modes/interactive/interactive-mode.js +40 -40
  174. package/dist/modes/interactive/state/interactive-state.d.ts +2 -0
  175. package/dist/modes/interactive/state/interactive-state.js +1 -1
  176. package/dist/modes/interactive/theme/dark.json +85 -85
  177. package/dist/modes/interactive/theme/light.json +84 -84
  178. package/dist/modes/interactive/theme/theme-schema.json +335 -335
  179. package/dist/modes/interactive/theme/warm.json +81 -81
  180. package/dist/node_modules/@pencil-agent/ai/dist/cli.js +0 -0
  181. package/dist/node_modules/@pencil-agent/ai/dist/models.generated.js +1 -1
  182. package/dist/node_modules/@pencil-agent/ai/dist/providers/anthropic.js +2 -2
  183. package/dist/node_modules/@pencil-agent/ai/dist/providers/openai-completions.js +5 -5
  184. package/dist/node_modules/@pencil-agent/ai/dist/providers/openai-responses.js +1 -1
  185. package/dist/node_modules/@pencil-agent/ai/dist/stream.js +1 -1
  186. package/dist/packages/protocol/src/commands.d.ts +33 -0
  187. package/dist/packages/protocol/src/flags.d.ts +20 -0
  188. package/dist/packages/protocol/src/hooks.d.ts +17 -0
  189. package/dist/packages/protocol/src/hooks.js +0 -0
  190. package/dist/packages/{extension-sdk → protocol}/src/index.d.ts +7 -4
  191. package/dist/packages/protocol/src/index.js +1 -0
  192. package/dist/packages/{extension-sdk → protocol}/src/lifecycle.d.ts +15 -27
  193. package/dist/packages/protocol/src/lifecycle.js +0 -0
  194. package/dist/packages/{extension-sdk → protocol}/src/tools.d.ts +1 -1
  195. package/dist/packages/protocol/src/tools.js +0 -0
  196. package/dist/public-config.d.ts +12 -0
  197. package/dist/public-config.js +1 -0
  198. package/dist/runtime.d.ts +9 -0
  199. package/dist/runtime.js +1 -0
  200. package/dist/session-compaction.d.ts +7 -0
  201. package/dist/session-compaction.js +1 -0
  202. package/dist/session.d.ts +7 -0
  203. package/dist/session.js +1 -0
  204. package/dist/skills.d.ts +7 -0
  205. package/dist/skills.js +1 -0
  206. package/dist/tools.d.ts +7 -0
  207. package/dist/tools.js +1 -0
  208. package/docs/ACP/345/215/217/350/256/256/351/233/206/346/210/220/345/274/200/345/217/221/346/226/207/346/241/243.md +851 -0
  209. package/docs/SDK-TESTING.md +364 -0
  210. package/docs/codex-goal-command-impl.md +1055 -1055
  211. package/docs/codex-goal-vs-grub.md +500 -500
  212. package/docs/custom-provider.md +27 -27
  213. package/docs/extensions.md +27 -27
  214. package/docs/keybindings.md +27 -27
  215. package/docs/loop /351/207/215/346/236/204/345/256/214/346/210/220/346/200/273/347/273/223.md" +250 -250
  216. package/docs/loop /351/207/215/346/236/204/345/256/214/346/210/220/346/212/245/345/221/212.md" +122 -122
  217. package/docs/loop /351/207/215/346/236/204/346/226/271/346/241/210.md" +1222 -1222
  218. package/docs/loop /351/207/215/346/236/204/346/226/271/346/241/210/345/256/236/347/216/260/346/212/245/345/221/212.md" +158 -158
  219. package/docs/loop /351/207/215/346/236/204/346/226/271/346/241/210/345/257/271/346/257/224/345/210/206/346/236/220.md" +128 -128
  220. package/docs/loop /351/207/215/346/236/204/350/256/241/345/210/222.md" +320 -320
  221. package/docs/loop-usage-examples.md +214 -214
  222. package/docs/mem-core/346/212/200/346/234/257/346/226/207/346/241/243.md +593 -0
  223. package/docs/models.md +27 -27
  224. package/docs/packages.md +27 -27
  225. package/docs/pi-design-philosophy.md +457 -457
  226. package/docs/planmode.md +1987 -1987
  227. package/docs/prompt-templates.md +27 -27
  228. package/docs/providers.md +27 -27
  229. package/docs/sdk.md +27 -27
  230. package/docs/skills.md +27 -27
  231. package/docs/startup-performance-optimization.md +301 -0
  232. package/docs/themes.md +27 -27
  233. package/docs/tui.md +27 -27
  234. package/docs//350/256/244/347/237/245/345/234/260/345/233/276.md +47 -0
  235. package/package.json +190 -162
  236. package/dist/packages/extension-sdk/src/index.js +0 -1
  237. package/docs/cc-agent-design.md +0 -1297
  238. package/docs/cc-tui-design.md +0 -1333
  239. package/docs//345/257/271/346/240/207Claude-Code.md +0 -1775
  240. /package/dist/packages/{extension-sdk/src/lifecycle.js → protocol/src/commands.js} +0 -0
  241. /package/dist/packages/{extension-sdk/src/tools.js → protocol/src/flags.js} +0 -0
@@ -1,243 +1,243 @@
1
- # Hacker News — Data Extraction
2
-
3
- `https://news.ycombinator.com` — YCombinator's link aggregator. Three access paths tested: `http_get` DOM scraping, Algolia search API, and the official HN Firebase API. All work without a browser.
4
-
5
- ## Do this first: pick your access path
6
-
7
- | Goal | Best approach | Latency |
8
- |------|--------------|---------|
9
- | Current front page (30 stories, real-time) | `http_get` + regex | ~170ms |
10
- | Historical / keyword search | Algolia search API | ~400ms |
11
- | Full comment tree (nested) | Algolia items API | ~300ms |
12
- | Specific item by ID | Firebase API | ~200ms |
13
- | 500 ranked story IDs | Firebase topstories | ~200ms (+ ~190ms/item after) |
14
-
15
- **Never use a browser for read-only HN tasks.** Everything is accessible over HTTP with no auth, no JS rendering needed.
16
-
17
- ---
18
-
19
- ## Path 1: http_get front page (fastest for real-time data)
20
-
21
- The front page HTML is ~34KB. Story order matches Firebase `/topstories.json` exactly — confirmed identical on 2026-04-18.
22
-
23
- ```python
24
- import re, html as htmllib
25
-
26
- page = http_get("https://news.ycombinator.com")
27
-
28
- # Extract all 30 story IDs (in rank order)
29
- story_ids = re.findall(r'<tr class="athing submission" id="(\d+)">', page)
30
-
31
- # Extract titles + URLs (same order as IDs)
32
- titles_urls = re.findall(
33
- r'class="titleline"[^>]*><a href="([^"]*)"[^>]*>(.*?)</a>', page
34
- )
35
-
36
- # Extract scores keyed by story ID (job posts have no score row)
37
- scores_by_id = {
38
- m.group(1): int(m.group(2))
39
- for m in re.finditer(
40
- r'<span class="score" id="score_(\d+)">(\d+) points</span>', page
41
- )
42
- }
43
-
44
- # Extract authors keyed by story ID (anchor on score span)
45
- authors_by_id = {}
46
- for m in re.finditer(
47
- r'<span class="score" id="score_(\d+)">\d+ points</span>'
48
- r'.*?class="hnuser">(.*?)</a>',
49
- page, re.DOTALL
50
- ):
51
- authors_by_id[m.group(1)] = m.group(2)
52
-
53
- # Extract comment counts keyed by story ID
54
- comments_by_id = {
55
- m.group(1): int(m.group(2))
56
- for m in re.finditer(
57
- r'href="item\?id=(\d+)">(\d+)&nbsp;comments</a>', page
58
- )
59
- }
60
-
61
- stories = []
62
- for i, sid in enumerate(story_ids):
63
- url, raw_title = titles_urls[i] if i < len(titles_urls) else ('', '')
64
- stories.append({
65
- 'rank': i + 1,
66
- 'id': sid,
67
- 'title': htmllib.unescape(raw_title), # MUST unescape — titles contain &#x27; etc.
68
- 'url': url,
69
- 'score': scores_by_id.get(sid), # None for job posts
70
- 'author': authors_by_id.get(sid),
71
- 'comments': comments_by_id.get(sid, 0),
72
- })
73
- ```
74
-
75
- **Gotchas:**
76
- - Titles contain HTML entities (`&#x27;` `&amp;` `&quot;` `&gt;`). Always call `html.unescape()`.
77
- - `<tr class="athing submission" id="...">` — the class is `athing submission`, not just `athing`. The `athing comtr` class is for comment rows.
78
- - Job/hiring posts (YC ads) appear in the list but have no score or author. `scores_by_id.get(sid)` returns `None` for them — check before comparing.
79
- - `re.DOTALL` multi-line patterns can cross story boundaries. Use ID-anchored patterns (as above) instead of positional zip for score/author.
80
- - The page only serves page 1 (30 items). Pages 2–4 exist at `?p=2` etc. but require a login cookie for page 3+.
81
-
82
- ---
83
-
84
- ## Path 2: Algolia search API (best for historical / keyword search)
85
-
86
- No rate limiting observed. Returns up to 1000 hits per query (`hitsPerPage` max is capped at ~1000 per Algolia plan).
87
-
88
- ```python
89
- import json
90
-
91
- # Keyword search — sorted by relevance
92
- data = json.loads(http_get(
93
- "https://hn.algolia.com/api/v1/search"
94
- "?query=llm&tags=story&hitsPerPage=20"
95
- ))
96
-
97
- # Date-sorted (most recent first)
98
- data = json.loads(http_get(
99
- "https://hn.algolia.com/api/v1/search_by_date"
100
- "?tags=story&hitsPerPage=20"
101
- ))
102
-
103
- # Paginate: add &page=N (0-indexed), up to data['nbPages']-1
104
- ```
105
-
106
- **Fields returned per story hit:**
107
- ```
108
- objectID, title, url, author, points, num_comments,
109
- created_at (ISO 8601), created_at_i (unix ts), story_id,
110
- children (list of comment IDs — flat, not tree),
111
- _tags, _highlightResult
112
- ```
113
-
114
- **Fields returned per comment hit:**
115
- ```
116
- objectID, comment_text, author, story_id, story_title, story_url,
117
- parent_id, created_at, created_at_i, points
118
- ```
119
- Note: comment hits use `comment_text`, NOT `text`. Story hits use `story_text` for self-post body.
120
-
121
- ### Tag filters
122
-
123
- Tags are AND by default, OR with parentheses:
124
-
125
- ```python
126
- # Story types
127
- "tags=story" # regular link/self posts
128
- "tags=show_hn" # Show HN
129
- "tags=ask_hn" # Ask HN
130
- "tags=poll" # polls
131
- "tags=job" # job posts
132
-
133
- # Combined AND
134
- "tags=story,front_page" # currently on front page
135
- "tags=story,author_pg" # stories submitted by pg
136
-
137
- # OR
138
- "tags=(ask_hn,show_hn),story" # Ask OR Show HN
139
-
140
- # By story ID (gets story + all its comments)
141
- "tags=story_47806725"
142
- ```
143
-
144
- ### Numeric filters
145
-
146
- ```python
147
- # Date range (unix timestamps)
148
- "numericFilters=created_at_i>1745000000"
149
- "numericFilters=created_at_i>1700000000,created_at_i<1750000000"
150
-
151
- # Point threshold
152
- "numericFilters=points>100"
153
- "numericFilters=points>500,points<1000"
154
- ```
155
-
156
- ### Full Algolia items API (nested comment tree)
157
-
158
- ```python
159
- import json
160
-
161
- thread = json.loads(http_get(
162
- "https://hn.algolia.com/api/v1/items/47806725"
163
- ))
164
- # thread['children'] = list of top-level comment objects
165
- # Each comment: author, text (HTML), created_at, children (nested replies)
166
- # Recursively walk children for full thread
167
-
168
- # Total comment count (recursive walk with stack):
169
- stack = list(thread.get('children', []))
170
- total = 0
171
- while stack:
172
- node = stack.pop()
173
- total += 1
174
- stack.extend(node.get('children', []))
175
- ```
176
-
177
- Confirmed: Algolia items returns 653 total comments for a 659-comment thread (some deleted). `text` field in items API is HTML with `<p>` tags and `<a>` links — may need to strip tags.
178
-
179
- ---
180
-
181
- ## Path 3: Official HN Firebase API
182
-
183
- Clean JSON, no scraping. Use for fetching specific items or building live feeds.
184
-
185
- ```python
186
- import json
187
-
188
- # Ranked story ID lists (no metadata — just IDs)
189
- top = json.loads(http_get("https://hacker-news.firebaseio.com/v0/topstories.json")) # 500 IDs
190
- new = json.loads(http_get("https://hacker-news.firebaseio.com/v0/newstories.json")) # 500 IDs
191
- best = json.loads(http_get("https://hacker-news.firebaseio.com/v0/beststories.json")) # 200 IDs
192
- ask = json.loads(http_get("https://hacker-news.firebaseio.com/v0/askstories.json")) # ~32 IDs
193
- show = json.loads(http_get("https://hacker-news.firebaseio.com/v0/showstories.json")) # ~119 IDs
194
- jobs = json.loads(http_get("https://hacker-news.firebaseio.com/v0/jobstories.json")) # ~31 IDs
195
-
196
- # Fetch a single item
197
- item = json.loads(http_get(
198
- "https://hacker-news.firebaseio.com/v0/item/47806725.json"
199
- ))
200
- # Fields: id, type, by, title, url, score, descendants (total comment count),
201
- # time (unix ts), kids (list of top-level comment IDs), text (self-post body)
202
-
203
- # Fetch a user profile
204
- user = json.loads(http_get(
205
- "https://hacker-news.firebaseio.com/v0/user/pg.json"
206
- ))
207
- # Fields: id, karma, created (unix ts), about (HTML), submitted (list of item IDs)
208
-
209
- # Highest current item ID (useful for polling new items)
210
- maxid = json.loads(http_get("https://hacker-news.firebaseio.com/v0/maxitem.json"))
211
- ```
212
-
213
- **Firebase vs Algolia tradeoff:**
214
- - Firebase `topstories` gives you 500 IDs in one call but then requires one HTTP call per item (~190ms each). Fetching all 500 items sequentially would take ~100 seconds.
215
- - Algolia returns full story data (title, points, author, comments) in one call for up to ~1000 results.
216
- - For "top 30 stories with full metadata": use `http_get` front page scrape (170ms total). For "top 500 stories with full metadata": use Algolia with `tags=front_page` or loop pages.
217
-
218
- ---
219
-
220
- ## Comment thread HTML (item page)
221
-
222
- For a large thread, the item page HTML (~1MB for 659 comments) loads ALL comments flat in a single request — no pagination, no JS required.
223
-
224
- ```python
225
- import re, html as htmllib
226
-
227
- page = http_get("https://news.ycombinator.com/item?id=47806725")
228
-
229
- # Count all comment IDs
230
- comment_ids = re.findall(r'<tr class="athing comtr" id="(\d+)">', page)
231
- # len(comment_ids) matches total comment count
232
-
233
- # Extract comment texts (careful: text spans multiple lines with <p> tags)
234
- # Use Algolia items API instead for structured access
235
- ```
236
-
237
- For structured comment access prefer Algolia items API — it returns a proper nested tree. The HTML item page is useful only when you need approximate comment count without an API call.
238
-
239
- ---
240
-
241
- ## Do NOT use a browser for HN
242
-
243
- All data is in plain HTML or JSON APIs. `goto_url()` + `wait_for_load()` takes 3–8 seconds; `http_get` takes 170–400ms. The JS `querySelectorAll` approach works (tested, returns correct data) but is 20–50x slower with no benefit.
1
+ # Hacker News — Data Extraction
2
+
3
+ `https://news.ycombinator.com` — YCombinator's link aggregator. Three access paths tested: `http_get` DOM scraping, Algolia search API, and the official HN Firebase API. All work without a browser.
4
+
5
+ ## Do this first: pick your access path
6
+
7
+ | Goal | Best approach | Latency |
8
+ |------|--------------|---------|
9
+ | Current front page (30 stories, real-time) | `http_get` + regex | ~170ms |
10
+ | Historical / keyword search | Algolia search API | ~400ms |
11
+ | Full comment tree (nested) | Algolia items API | ~300ms |
12
+ | Specific item by ID | Firebase API | ~200ms |
13
+ | 500 ranked story IDs | Firebase topstories | ~200ms (+ ~190ms/item after) |
14
+
15
+ **Never use a browser for read-only HN tasks.** Everything is accessible over HTTP with no auth, no JS rendering needed.
16
+
17
+ ---
18
+
19
+ ## Path 1: http_get front page (fastest for real-time data)
20
+
21
+ The front page HTML is ~34KB. Story order matches Firebase `/topstories.json` exactly — confirmed identical on 2026-04-18.
22
+
23
+ ```python
24
+ import re, html as htmllib
25
+
26
+ page = http_get("https://news.ycombinator.com")
27
+
28
+ # Extract all 30 story IDs (in rank order)
29
+ story_ids = re.findall(r'<tr class="athing submission" id="(\d+)">', page)
30
+
31
+ # Extract titles + URLs (same order as IDs)
32
+ titles_urls = re.findall(
33
+ r'class="titleline"[^>]*><a href="([^"]*)"[^>]*>(.*?)</a>', page
34
+ )
35
+
36
+ # Extract scores keyed by story ID (job posts have no score row)
37
+ scores_by_id = {
38
+ m.group(1): int(m.group(2))
39
+ for m in re.finditer(
40
+ r'<span class="score" id="score_(\d+)">(\d+) points</span>', page
41
+ )
42
+ }
43
+
44
+ # Extract authors keyed by story ID (anchor on score span)
45
+ authors_by_id = {}
46
+ for m in re.finditer(
47
+ r'<span class="score" id="score_(\d+)">\d+ points</span>'
48
+ r'.*?class="hnuser">(.*?)</a>',
49
+ page, re.DOTALL
50
+ ):
51
+ authors_by_id[m.group(1)] = m.group(2)
52
+
53
+ # Extract comment counts keyed by story ID
54
+ comments_by_id = {
55
+ m.group(1): int(m.group(2))
56
+ for m in re.finditer(
57
+ r'href="item\?id=(\d+)">(\d+)&nbsp;comments</a>', page
58
+ )
59
+ }
60
+
61
+ stories = []
62
+ for i, sid in enumerate(story_ids):
63
+ url, raw_title = titles_urls[i] if i < len(titles_urls) else ('', '')
64
+ stories.append({
65
+ 'rank': i + 1,
66
+ 'id': sid,
67
+ 'title': htmllib.unescape(raw_title), # MUST unescape — titles contain &#x27; etc.
68
+ 'url': url,
69
+ 'score': scores_by_id.get(sid), # None for job posts
70
+ 'author': authors_by_id.get(sid),
71
+ 'comments': comments_by_id.get(sid, 0),
72
+ })
73
+ ```
74
+
75
+ **Gotchas:**
76
+ - Titles contain HTML entities (`&#x27;` `&amp;` `&quot;` `&gt;`). Always call `html.unescape()`.
77
+ - `<tr class="athing submission" id="...">` — the class is `athing submission`, not just `athing`. The `athing comtr` class is for comment rows.
78
+ - Job/hiring posts (YC ads) appear in the list but have no score or author. `scores_by_id.get(sid)` returns `None` for them — check before comparing.
79
+ - `re.DOTALL` multi-line patterns can cross story boundaries. Use ID-anchored patterns (as above) instead of positional zip for score/author.
80
+ - The page only serves page 1 (30 items). Pages 2–4 exist at `?p=2` etc. but require a login cookie for page 3+.
81
+
82
+ ---
83
+
84
+ ## Path 2: Algolia search API (best for historical / keyword search)
85
+
86
+ No rate limiting observed. Returns up to 1000 hits per query (`hitsPerPage` max is capped at ~1000 per Algolia plan).
87
+
88
+ ```python
89
+ import json
90
+
91
+ # Keyword search — sorted by relevance
92
+ data = json.loads(http_get(
93
+ "https://hn.algolia.com/api/v1/search"
94
+ "?query=llm&tags=story&hitsPerPage=20"
95
+ ))
96
+
97
+ # Date-sorted (most recent first)
98
+ data = json.loads(http_get(
99
+ "https://hn.algolia.com/api/v1/search_by_date"
100
+ "?tags=story&hitsPerPage=20"
101
+ ))
102
+
103
+ # Paginate: add &page=N (0-indexed), up to data['nbPages']-1
104
+ ```
105
+
106
+ **Fields returned per story hit:**
107
+ ```
108
+ objectID, title, url, author, points, num_comments,
109
+ created_at (ISO 8601), created_at_i (unix ts), story_id,
110
+ children (list of comment IDs — flat, not tree),
111
+ _tags, _highlightResult
112
+ ```
113
+
114
+ **Fields returned per comment hit:**
115
+ ```
116
+ objectID, comment_text, author, story_id, story_title, story_url,
117
+ parent_id, created_at, created_at_i, points
118
+ ```
119
+ Note: comment hits use `comment_text`, NOT `text`. Story hits use `story_text` for self-post body.
120
+
121
+ ### Tag filters
122
+
123
+ Tags are AND by default, OR with parentheses:
124
+
125
+ ```python
126
+ # Story types
127
+ "tags=story" # regular link/self posts
128
+ "tags=show_hn" # Show HN
129
+ "tags=ask_hn" # Ask HN
130
+ "tags=poll" # polls
131
+ "tags=job" # job posts
132
+
133
+ # Combined AND
134
+ "tags=story,front_page" # currently on front page
135
+ "tags=story,author_pg" # stories submitted by pg
136
+
137
+ # OR
138
+ "tags=(ask_hn,show_hn),story" # Ask OR Show HN
139
+
140
+ # By story ID (gets story + all its comments)
141
+ "tags=story_47806725"
142
+ ```
143
+
144
+ ### Numeric filters
145
+
146
+ ```python
147
+ # Date range (unix timestamps)
148
+ "numericFilters=created_at_i>1745000000"
149
+ "numericFilters=created_at_i>1700000000,created_at_i<1750000000"
150
+
151
+ # Point threshold
152
+ "numericFilters=points>100"
153
+ "numericFilters=points>500,points<1000"
154
+ ```
155
+
156
+ ### Full Algolia items API (nested comment tree)
157
+
158
+ ```python
159
+ import json
160
+
161
+ thread = json.loads(http_get(
162
+ "https://hn.algolia.com/api/v1/items/47806725"
163
+ ))
164
+ # thread['children'] = list of top-level comment objects
165
+ # Each comment: author, text (HTML), created_at, children (nested replies)
166
+ # Recursively walk children for full thread
167
+
168
+ # Total comment count (recursive walk with stack):
169
+ stack = list(thread.get('children', []))
170
+ total = 0
171
+ while stack:
172
+ node = stack.pop()
173
+ total += 1
174
+ stack.extend(node.get('children', []))
175
+ ```
176
+
177
+ Confirmed: Algolia items returns 653 total comments for a 659-comment thread (some deleted). `text` field in items API is HTML with `<p>` tags and `<a>` links — may need to strip tags.
178
+
179
+ ---
180
+
181
+ ## Path 3: Official HN Firebase API
182
+
183
+ Clean JSON, no scraping. Use for fetching specific items or building live feeds.
184
+
185
+ ```python
186
+ import json
187
+
188
+ # Ranked story ID lists (no metadata — just IDs)
189
+ top = json.loads(http_get("https://hacker-news.firebaseio.com/v0/topstories.json")) # 500 IDs
190
+ new = json.loads(http_get("https://hacker-news.firebaseio.com/v0/newstories.json")) # 500 IDs
191
+ best = json.loads(http_get("https://hacker-news.firebaseio.com/v0/beststories.json")) # 200 IDs
192
+ ask = json.loads(http_get("https://hacker-news.firebaseio.com/v0/askstories.json")) # ~32 IDs
193
+ show = json.loads(http_get("https://hacker-news.firebaseio.com/v0/showstories.json")) # ~119 IDs
194
+ jobs = json.loads(http_get("https://hacker-news.firebaseio.com/v0/jobstories.json")) # ~31 IDs
195
+
196
+ # Fetch a single item
197
+ item = json.loads(http_get(
198
+ "https://hacker-news.firebaseio.com/v0/item/47806725.json"
199
+ ))
200
+ # Fields: id, type, by, title, url, score, descendants (total comment count),
201
+ # time (unix ts), kids (list of top-level comment IDs), text (self-post body)
202
+
203
+ # Fetch a user profile
204
+ user = json.loads(http_get(
205
+ "https://hacker-news.firebaseio.com/v0/user/pg.json"
206
+ ))
207
+ # Fields: id, karma, created (unix ts), about (HTML), submitted (list of item IDs)
208
+
209
+ # Highest current item ID (useful for polling new items)
210
+ maxid = json.loads(http_get("https://hacker-news.firebaseio.com/v0/maxitem.json"))
211
+ ```
212
+
213
+ **Firebase vs Algolia tradeoff:**
214
+ - Firebase `topstories` gives you 500 IDs in one call but then requires one HTTP call per item (~190ms each). Fetching all 500 items sequentially would take ~100 seconds.
215
+ - Algolia returns full story data (title, points, author, comments) in one call for up to ~1000 results.
216
+ - For "top 30 stories with full metadata": use `http_get` front page scrape (170ms total). For "top 500 stories with full metadata": use Algolia with `tags=front_page` or loop pages.
217
+
218
+ ---
219
+
220
+ ## Comment thread HTML (item page)
221
+
222
+ For a large thread, the item page HTML (~1MB for 659 comments) loads ALL comments flat in a single request — no pagination, no JS required.
223
+
224
+ ```python
225
+ import re, html as htmllib
226
+
227
+ page = http_get("https://news.ycombinator.com/item?id=47806725")
228
+
229
+ # Count all comment IDs
230
+ comment_ids = re.findall(r'<tr class="athing comtr" id="(\d+)">', page)
231
+ # len(comment_ids) matches total comment count
232
+
233
+ # Extract comment texts (careful: text spans multiple lines with <p> tags)
234
+ # Use Algolia items API instead for structured access
235
+ ```
236
+
237
+ For structured comment access prefer Algolia items API — it returns a proper nested tree. The HTML item page is useful only when you need approximate comment count without an API call.
238
+
239
+ ---
240
+
241
+ ## Do NOT use a browser for HN
242
+
243
+ All data is in plain HTML or JSON APIs. `goto_url()` + `wait_for_load()` takes 3–8 seconds; `http_get` takes 170–400ms. The JS `querySelectorAll` approach works (tested, returns correct data) but is 20–50x slower with no benefit.