@pencil-agent/nano-pencil 2.0.0-beta.8 → 2.0.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
Files changed (241) hide show
  1. package/README.md +267 -267
  2. package/dist/build-meta.json +3 -3
  3. package/dist/core/export-html/AGENT.md +11 -11
  4. package/dist/core/export-html/template.css +971 -971
  5. package/dist/core/export-html/template.html +54 -54
  6. package/dist/core/extensions-host/index.d.ts +1 -1
  7. package/dist/core/extensions-host/loader.js +1 -1
  8. package/dist/core/extensions-host/runner.d.ts +1 -0
  9. package/dist/core/extensions-host/runner.js +2 -2
  10. package/dist/core/extensions-host/types.d.ts +17 -22
  11. package/dist/core/lib/ai/src/types.d.ts +12 -2
  12. package/dist/core/persona/persona-manager.js +5 -2
  13. package/dist/core/runtime/agent-session.js +3 -3
  14. package/dist/core/runtime/extension-core-bindings.d.ts +1 -0
  15. package/dist/core/runtime/extension-core-bindings.js +2 -2
  16. package/dist/extensions/builtin/AGENT.md +115 -115
  17. package/dist/extensions/builtin/browser/AGENT.md +17 -17
  18. package/dist/extensions/builtin/browser/agent-workspace/agent_helpers.py +12 -12
  19. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/amazon/product-search.md +198 -198
  20. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/archive-org/scraping.md +341 -341
  21. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/arxiv/scraping.md +311 -311
  22. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/arxiv-bulk/scraping.md +333 -333
  23. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/atlas/overview.md +70 -70
  24. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/booking-com/scraping.md +578 -578
  25. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/capterra/scraping.md +440 -440
  26. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/centilebrain/generate-estimates.md +110 -110
  27. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/coingecko/scraping.md +325 -325
  28. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/coinmarketcap/scraping.md +463 -463
  29. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/coursera/scraping.md +360 -360
  30. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/craigslist/scraping.md +390 -390
  31. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/crossref/scraping.md +568 -568
  32. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/dev-to/scraping.md +323 -323
  33. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/duckduckgo/scraping.md +349 -349
  34. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/ebay/scraping.md +435 -435
  35. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/etsy/scraping.md +506 -506
  36. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/eventbrite/scraping.md +363 -363
  37. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/expedia/automation.md +168 -168
  38. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/facebook/groups.md +236 -236
  39. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/facebook/pages.md +295 -295
  40. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/framer/editor.md +108 -108
  41. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/fred/scraping.md +493 -493
  42. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/g2/scraping.md +580 -580
  43. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/genius/scraping.md +511 -511
  44. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/github/repo-actions.md +65 -65
  45. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/github/scraping.md +184 -184
  46. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/glassdoor/scraping.md +543 -543
  47. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/gmail/compose.md +122 -122
  48. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/goodreads/scraping.md +461 -461
  49. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/gutenberg/scraping.md +383 -383
  50. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/hackernews/scraping.md +243 -243
  51. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/howlongtobeat/scraping.md +473 -473
  52. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/imdb/scraping.md +271 -271
  53. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/itch-io/scraping.md +436 -436
  54. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/job-boards/indeed-glassdoor.md +1021 -1021
  55. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/letterboxd/scraping.md +349 -349
  56. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/linkedin/invitation-manager.md +109 -109
  57. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/loom/folder-enumeration.md +170 -170
  58. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/macrotrends/scraping.md +537 -537
  59. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/medium/article-hydration.md +120 -120
  60. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/medium/scraping.md +414 -414
  61. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/metacritic/scraping.md +477 -477
  62. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/musicbrainz/scraping.md +478 -478
  63. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/nasa/scraping.md +339 -339
  64. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/news-aggregation/multi-source.md +205 -205
  65. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/open-library/scraping.md +472 -472
  66. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/openalex/scraping.md +470 -470
  67. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/openstreetmap/scraping.md +490 -490
  68. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/package-registries/npm-pypi.md +478 -478
  69. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/polymarket/scraping.md +234 -234
  70. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/producthunt/scraping.md +307 -307
  71. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/pubmed/scraping.md +421 -421
  72. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/quora/scraping.md +364 -364
  73. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/rawg/scraping.md +352 -352
  74. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/reddit/scraping.md +124 -124
  75. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/rest-countries/scraping.md +233 -233
  76. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/sec-edgar/scraping.md +361 -361
  77. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/shopify-admin/README.md +36 -36
  78. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/shopify-admin/embedded-apps.md +72 -72
  79. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/shopify-admin/knowledge-base.md +109 -109
  80. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/shopify-admin/polaris-inputs.md +137 -137
  81. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/soundcloud/scraping.md +362 -362
  82. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/spotify/scraping.md +339 -339
  83. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/stackoverflow/scraping.md +435 -435
  84. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/steam/scraping.md +575 -575
  85. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/substack/scraping.md +338 -338
  86. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/thetechgeeks/pricing.md +52 -52
  87. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/tiktok/upload.md +107 -107
  88. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/tradingview/scraping.md +309 -309
  89. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/trello/boards-and-lists.md +88 -88
  90. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/trustpilot/scraping.md +375 -375
  91. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/walmart/scraping.md +444 -444
  92. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/wayback-machine/scraping.md +306 -306
  93. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/weather/scraping.md +398 -398
  94. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/wellfound/scraping.md +596 -596
  95. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/world-bank/scraping.md +356 -356
  96. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/xiaohongshu/scraping.md +84 -84
  97. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/youtube/scraping.md +418 -418
  98. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/zillow/scraping.md +433 -433
  99. package/dist/extensions/builtin/browser/browser.md +73 -73
  100. package/dist/extensions/builtin/browser/install.md +142 -142
  101. package/dist/extensions/builtin/browser/interaction-skills/connection.md +48 -48
  102. package/dist/extensions/builtin/browser/interaction-skills/cookies.md +3 -3
  103. package/dist/extensions/builtin/browser/interaction-skills/cross-origin-iframes.md +3 -3
  104. package/dist/extensions/builtin/browser/interaction-skills/dialogs.md +64 -64
  105. package/dist/extensions/builtin/browser/interaction-skills/downloads.md +3 -3
  106. package/dist/extensions/builtin/browser/interaction-skills/drag-and-drop.md +3 -3
  107. package/dist/extensions/builtin/browser/interaction-skills/dropdowns.md +3 -3
  108. package/dist/extensions/builtin/browser/interaction-skills/iframes.md +3 -3
  109. package/dist/extensions/builtin/browser/interaction-skills/network-requests.md +3 -3
  110. package/dist/extensions/builtin/browser/interaction-skills/print-as-pdf.md +3 -3
  111. package/dist/extensions/builtin/browser/interaction-skills/profile-sync.md +90 -90
  112. package/dist/extensions/builtin/browser/interaction-skills/screenshots.md +17 -17
  113. package/dist/extensions/builtin/browser/interaction-skills/scrolling.md +3 -3
  114. package/dist/extensions/builtin/browser/interaction-skills/shadow-dom.md +3 -3
  115. package/dist/extensions/builtin/browser/interaction-skills/tabs.md +69 -69
  116. package/dist/extensions/builtin/browser/interaction-skills/uploads.md +1 -1
  117. package/dist/extensions/builtin/browser/interaction-skills/viewport.md +3 -3
  118. package/dist/extensions/builtin/browser/src/browser_harness/AGENT.md +15 -15
  119. package/dist/extensions/builtin/browser/src/browser_harness/__init__.py +8 -8
  120. package/dist/extensions/builtin/browser/src/browser_harness/_ipc.py +90 -90
  121. package/dist/extensions/builtin/browser/src/browser_harness/admin.py +722 -722
  122. package/dist/extensions/builtin/browser/src/browser_harness/daemon.py +328 -328
  123. package/dist/extensions/builtin/browser/src/browser_harness/helpers.py +396 -396
  124. package/dist/extensions/builtin/browser/src/browser_harness/run.py +103 -103
  125. package/dist/extensions/builtin/discipline/skills/brainstorming/SKILL.md +33 -33
  126. package/dist/extensions/builtin/discipline/skills/executing-plans/SKILL.md +25 -25
  127. package/dist/extensions/builtin/discipline/skills/finishing-development-branch/SKILL.md +25 -25
  128. package/dist/extensions/builtin/discipline/skills/receiving-code-review/SKILL.md +22 -22
  129. package/dist/extensions/builtin/discipline/skills/requesting-code-review/SKILL.md +31 -31
  130. package/dist/extensions/builtin/discipline/skills/systematic-debugging/SKILL.md +28 -28
  131. package/dist/extensions/builtin/discipline/skills/test-driven-development/SKILL.md +32 -32
  132. package/dist/extensions/builtin/discipline/skills/using-git-worktrees/SKILL.md +25 -25
  133. package/dist/extensions/builtin/discipline/skills/verification-before-completion/SKILL.md +27 -27
  134. package/dist/extensions/builtin/discipline/skills/writing-plans/SKILL.md +26 -26
  135. package/dist/extensions/builtin/goal/README.md +67 -67
  136. package/dist/extensions/builtin/goal/goal-controller.d.ts +39 -10
  137. package/dist/extensions/builtin/goal/goal-controller.js +1 -1
  138. package/dist/extensions/builtin/goal/goal-format.js +1 -1
  139. package/dist/extensions/builtin/goal/goal-prompts.d.ts +2 -0
  140. package/dist/extensions/builtin/goal/goal-prompts.js +5 -4
  141. package/dist/extensions/builtin/goal/goal-store.js +1 -1
  142. package/dist/extensions/builtin/goal/index.d.ts +1 -1
  143. package/dist/extensions/builtin/goal/index.js +10 -7
  144. package/dist/extensions/builtin/grub/README.md +112 -112
  145. package/dist/extensions/builtin/link-world/agent-workspace/README.md +16 -16
  146. package/dist/extensions/builtin/link-world/index.js +6 -6
  147. package/dist/extensions/builtin/link-world/internet-search/internet-search.md +65 -65
  148. package/dist/extensions/builtin/link-world/link-world-agent.md +82 -82
  149. package/dist/extensions/builtin/link-world/linkworld.md +313 -313
  150. package/dist/extensions/builtin/link-world/{network-routing.md → network-routing/network-routing.md} +67 -67
  151. package/dist/extensions/builtin/loop/README.md +92 -92
  152. package/dist/extensions/builtin/mcp/figma-design.md +68 -68
  153. package/dist/extensions/builtin/mcp/mcp-management.md +85 -85
  154. package/dist/extensions/builtin/plan/index.js +1 -1
  155. package/dist/extensions/builtin/recap/AGENT.md +15 -15
  156. package/dist/extensions/builtin/sal/README.md +72 -72
  157. package/dist/extensions/builtin/security-audit/README.md +289 -289
  158. package/dist/extensions/builtin/task/task-store.d.ts +4 -0
  159. package/dist/extensions/builtin/task/task-store.js +1 -1
  160. package/dist/extensions/builtin/team/AGENT.md +112 -112
  161. package/dist/extensions/builtin/team/TESTING.md +299 -299
  162. package/dist/extensions/builtin/token-save/README.md +56 -56
  163. package/dist/extensions/optional/AGENT.md +10 -10
  164. package/dist/index.d.ts +5 -30
  165. package/dist/index.js +1 -1
  166. package/dist/models.d.ts +7 -0
  167. package/dist/models.js +1 -0
  168. package/dist/modes/interactive/components/footer.js +1 -1
  169. package/dist/modes/interactive/components/task-status-panel.d.ts +36 -0
  170. package/dist/modes/interactive/components/task-status-panel.js +1 -0
  171. package/dist/modes/interactive/controllers/stream-render-controller.d.ts +7 -0
  172. package/dist/modes/interactive/controllers/stream-render-controller.js +2 -2
  173. package/dist/modes/interactive/interactive-mode.js +40 -40
  174. package/dist/modes/interactive/state/interactive-state.d.ts +2 -0
  175. package/dist/modes/interactive/state/interactive-state.js +1 -1
  176. package/dist/modes/interactive/theme/dark.json +85 -85
  177. package/dist/modes/interactive/theme/light.json +84 -84
  178. package/dist/modes/interactive/theme/theme-schema.json +335 -335
  179. package/dist/modes/interactive/theme/warm.json +81 -81
  180. package/dist/node_modules/@pencil-agent/ai/dist/cli.js +0 -0
  181. package/dist/node_modules/@pencil-agent/ai/dist/models.generated.js +1 -1
  182. package/dist/node_modules/@pencil-agent/ai/dist/providers/anthropic.js +2 -2
  183. package/dist/node_modules/@pencil-agent/ai/dist/providers/openai-completions.js +5 -5
  184. package/dist/node_modules/@pencil-agent/ai/dist/providers/openai-responses.js +1 -1
  185. package/dist/node_modules/@pencil-agent/ai/dist/stream.js +1 -1
  186. package/dist/packages/protocol/src/commands.d.ts +33 -0
  187. package/dist/packages/protocol/src/flags.d.ts +20 -0
  188. package/dist/packages/protocol/src/hooks.d.ts +17 -0
  189. package/dist/packages/protocol/src/hooks.js +0 -0
  190. package/dist/packages/{extension-sdk → protocol}/src/index.d.ts +7 -4
  191. package/dist/packages/protocol/src/index.js +1 -0
  192. package/dist/packages/{extension-sdk → protocol}/src/lifecycle.d.ts +15 -27
  193. package/dist/packages/protocol/src/lifecycle.js +0 -0
  194. package/dist/packages/{extension-sdk → protocol}/src/tools.d.ts +1 -1
  195. package/dist/packages/protocol/src/tools.js +0 -0
  196. package/dist/public-config.d.ts +12 -0
  197. package/dist/public-config.js +1 -0
  198. package/dist/runtime.d.ts +9 -0
  199. package/dist/runtime.js +1 -0
  200. package/dist/session-compaction.d.ts +7 -0
  201. package/dist/session-compaction.js +1 -0
  202. package/dist/session.d.ts +7 -0
  203. package/dist/session.js +1 -0
  204. package/dist/skills.d.ts +7 -0
  205. package/dist/skills.js +1 -0
  206. package/dist/tools.d.ts +7 -0
  207. package/dist/tools.js +1 -0
  208. package/docs/ACP/345/215/217/350/256/256/351/233/206/346/210/220/345/274/200/345/217/221/346/226/207/346/241/243.md +851 -0
  209. package/docs/SDK-TESTING.md +364 -0
  210. package/docs/codex-goal-command-impl.md +1055 -1055
  211. package/docs/codex-goal-vs-grub.md +500 -500
  212. package/docs/custom-provider.md +27 -27
  213. package/docs/extensions.md +27 -27
  214. package/docs/keybindings.md +27 -27
  215. package/docs/loop /351/207/215/346/236/204/345/256/214/346/210/220/346/200/273/347/273/223.md" +250 -250
  216. package/docs/loop /351/207/215/346/236/204/345/256/214/346/210/220/346/212/245/345/221/212.md" +122 -122
  217. package/docs/loop /351/207/215/346/236/204/346/226/271/346/241/210.md" +1222 -1222
  218. package/docs/loop /351/207/215/346/236/204/346/226/271/346/241/210/345/256/236/347/216/260/346/212/245/345/221/212.md" +158 -158
  219. package/docs/loop /351/207/215/346/236/204/346/226/271/346/241/210/345/257/271/346/257/224/345/210/206/346/236/220.md" +128 -128
  220. package/docs/loop /351/207/215/346/236/204/350/256/241/345/210/222.md" +320 -320
  221. package/docs/loop-usage-examples.md +214 -214
  222. package/docs/mem-core/346/212/200/346/234/257/346/226/207/346/241/243.md +593 -0
  223. package/docs/models.md +27 -27
  224. package/docs/packages.md +27 -27
  225. package/docs/pi-design-philosophy.md +457 -457
  226. package/docs/planmode.md +1987 -1987
  227. package/docs/prompt-templates.md +27 -27
  228. package/docs/providers.md +27 -27
  229. package/docs/sdk.md +27 -27
  230. package/docs/skills.md +27 -27
  231. package/docs/startup-performance-optimization.md +301 -0
  232. package/docs/themes.md +27 -27
  233. package/docs/tui.md +27 -27
  234. package/docs//350/256/244/347/237/245/345/234/260/345/233/276.md +47 -0
  235. package/package.json +190 -162
  236. package/dist/packages/extension-sdk/src/index.js +0 -1
  237. package/docs/cc-agent-design.md +0 -1297
  238. package/docs/cc-tui-design.md +0 -1333
  239. package/docs//345/257/271/346/240/207Claude-Code.md +0 -1775
  240. /package/dist/packages/{extension-sdk/src/lifecycle.js → protocol/src/commands.js} +0 -0
  241. /package/dist/packages/{extension-sdk/src/tools.js → protocol/src/flags.js} +0 -0
@@ -1,435 +1,435 @@
1
- # Stack Overflow — Scraping & Data Extraction
2
-
3
- `https://stackoverflow.com` — all public read-only data is available via the Stack Exchange API v2.3. No auth, no browser required for any read operation. API is fast, returns gzip-compressed JSON, and works transparently with `http_get`.
4
-
5
- ## Do this first: pick your access path
6
-
7
- | Goal | Best approach | Notes |
8
- |------|--------------|-------|
9
- | Top/hot questions by tag | `GET /2.3/questions` | Add `filter=withbody` for question text |
10
- | Answers for a question | `GET /2.3/questions/{id}/answers` | Add `filter=withbody` for answer text |
11
- | Search by keyword + tag | `GET /2.3/search/advanced` | More filters than `/search` |
12
- | Simple title keyword search | `GET /2.3/search` | `intitle=` param |
13
- | Fetch by known question IDs | `GET /2.3/questions/{id1};{id2};...` | Semicolon-delimited batch, up to 100 |
14
- | User profile + reputation | `GET /2.3/users/{id}` | Public fields only |
15
- | User activity timeline | `GET /2.3/users/{id}/timeline` | Events: badges, answers, questions |
16
- | User's questions / answers | `GET /2.3/users/{id}/questions` or `/answers` | Standard listing |
17
- | Comments on a post | `GET /2.3/questions/{id}/comments` | Needs `filter=withbody` for body |
18
- | Related questions | `GET /2.3/questions/{id}/related` | Returns linked/similar questions |
19
- | Answer by ID directly | `GET /2.3/answers/{id}` | One or more semicolon-separated IDs |
20
- | Popular tags | `GET /2.3/tags` | Sort by `popular`, `activity`, or `name` |
21
- | Site-wide statistics | `GET /2.3/info` | Total questions, quota, etc. |
22
- | Question HTML page | `http_get` with User-Agent | Returns 777KB HTML; prefer API |
23
-
24
- **Use the API for all data tasks.** The HTML page is 777KB, lacks clean structure, and the JSON-LD block only contains `WebSite` and `Organization` objects (no `QAPage` or `Question` schema). The API returns the same data in milliseconds, fully structured.
25
-
26
- ---
27
-
28
- ## Quota limits
29
-
30
- The API is unauthenticated-friendly but strictly quota-capped per IP per day:
31
-
32
- | Auth level | Daily quota | Burst |
33
- |------------|-------------|-------|
34
- | No key (unauthenticated) | **300 requests/day** | No enforced burst limit observed |
35
- | With API key | **10,000 requests/day** | Same |
36
-
37
- Check your remaining quota in every response envelope:
38
-
39
- ```python
40
- import json
41
- data = json.loads(http_get("https://api.stackexchange.com/2.3/info?site=stackoverflow"))
42
- print("Quota remaining:", data.get('quota_remaining')) # e.g. 273
43
- print("Quota max:", data.get('quota_max')) # 300 unauthenticated, 10000 with key
44
- # Confirmed: quota_max=300, quota_remaining decrements per call
45
- ```
46
-
47
- Every API response includes `quota_remaining` in the envelope. Monitor it. When it hits 0, all calls return HTTP 400 with `error_id: 502` (throttle_violation). There is no retry-after header — wait until midnight UTC.
48
-
49
- **If you have an API key**, append `&key=YOUR_KEY` to any URL to use the 10,000/day quota.
50
-
51
- ---
52
-
53
- ## Response envelope
54
-
55
- Every response from the Stack Exchange API is wrapped in a consistent envelope:
56
-
57
- ```python
58
- {
59
- "items": [...], # list of result objects
60
- "has_more": True/False, # whether more pages exist
61
- "quota_max": 300, # total daily quota
62
- "quota_remaining": 273, # calls left today
63
- "backoff": None # seconds to wait before next call (rare)
64
- }
65
- ```
66
-
67
- Always check `data.get('backoff')` — if it returns an integer, sleep that many seconds before the next call. Ignoring it causes throttle errors.
68
-
69
- Error responses raise `urllib.error.HTTPError` (not a JSON envelope):
70
- - HTTP 400 — invalid parameter (e.g. bad site name) — raises exception
71
- - HTTP 400 with JSON body — quota exhausted or throttle_violation
72
-
73
- ```python
74
- try:
75
- data = json.loads(http_get("https://api.stackexchange.com/2.3/questions?site=stackoverflow&pagesize=1"))
76
- except Exception as e:
77
- print("API error:", e) # HTTPError HTTP Error 400: Bad Request
78
- ```
79
-
80
- ---
81
-
82
- ## `filter=withbody` — required for post content
83
-
84
- By default, the API strips the `body` field from all responses. You **must** add `filter=withbody` to get question or answer text. This applies to questions, answers, and comments alike.
85
-
86
- ```python
87
- import json
88
-
89
- # WITHOUT filter=withbody — body field is ABSENT
90
- data = json.loads(http_get("https://api.stackexchange.com/2.3/questions?order=desc&sort=votes&tagged=python&site=stackoverflow&pagesize=1"))
91
- q = data['items'][0]
92
- print("Has body:", 'body' in q) # False
93
- print("Keys:", sorted(q.keys()))
94
- # ['accepted_answer_id', 'answer_count', 'content_license', 'creation_date',
95
- # 'is_answered', 'last_activity_date', 'last_edit_date', 'link', 'owner',
96
- # 'protected_date', 'question_id', 'score', 'tags', 'title', 'view_count']
97
-
98
- # WITH filter=withbody — body field is PRESENT
99
- data = json.loads(http_get("https://api.stackexchange.com/2.3/questions?order=desc&sort=votes&tagged=python&site=stackoverflow&pagesize=1&filter=withbody"))
100
- q = data['items'][0]
101
- print("Has body:", 'body' in q) # True
102
- print("Body preview:", q['body'][:60])
103
- # '<p>What functionality does the <a href="https://do...'
104
- ```
105
-
106
- ---
107
-
108
- ## HTML encoding in API responses
109
-
110
- The API returns HTML in two contexts, and plain text in a third:
111
-
112
- - **`body` field** (questions, answers, comments) — full HTML markup. Headings, code blocks, links, blockquotes, lists. Strip with `html.parser` for plain text.
113
- - **`title` field** — HTML-entity-encoded plain text. Quotes, angle brackets, and ampersands are escaped (`&quot;`, `&lt;`, `&amp;`). Decode with `html.unescape()`.
114
- - **`display_name`, `link`, `tags`** — plain text, no encoding.
115
-
116
- ```python
117
- import json, html
118
- from html.parser import HTMLParser
119
-
120
- data = json.loads(http_get("https://api.stackexchange.com/2.3/questions/231767?site=stackoverflow&filter=withbody"))
121
- q = data['items'][0]
122
-
123
- # Title has HTML entities
124
- print("Raw title:", q['title'])
125
- # 'What does the &quot;yield&quot; keyword do in Python?'
126
- print("Decoded:", html.unescape(q['title']))
127
- # 'What does the "yield" keyword do in Python?'
128
-
129
- # Body is full HTML — strip for plain text
130
- class Stripper(HTMLParser):
131
- def __init__(self):
132
- super().__init__()
133
- self.text = []
134
- def handle_data(self, d):
135
- self.text.append(d)
136
- def get_text(self):
137
- return ''.join(self.text)
138
-
139
- s = Stripper()
140
- s.feed(q['body'])
141
- print(s.get_text()[:200])
142
- # 'What functionality does the yield keyword do in Python?\nWhat is the ...'
143
- ```
144
-
145
- ---
146
-
147
- ## Common workflows
148
-
149
- ### Top questions by tag (API)
150
-
151
- ```python
152
- import json, html
153
- data = json.loads(http_get(
154
- "https://api.stackexchange.com/2.3/questions"
155
- "?order=desc&sort=votes&tagged=python&site=stackoverflow&pagesize=5&filter=withbody"
156
- ))
157
- for q in data['items']:
158
- print(q['question_id'], q['score'], html.unescape(q['title'])[:60])
159
- print(" Tags:", q['tags'][:3], "Answers:", q['answer_count'])
160
- print("Quota remaining:", data.get('quota_remaining'))
161
- # 231767 13133 What does the "yield" keyword do in Python?
162
- # Tags: ['python', 'iterator', 'generator'] Answers: 51
163
- # 419163 8438 What does if __name__ == "__main__": do?
164
- # Tags: ['python', 'namespaces', 'program-entry-point'] Answers: 40
165
- # Quota remaining: 299
166
- ```
167
-
168
- Sort options for `/questions`: `activity`, `votes`, `creation`, `hot`, `week`, `month`.
169
-
170
- ### Answers for a question
171
-
172
- ```python
173
- import json
174
- data = json.loads(http_get(
175
- "https://api.stackexchange.com/2.3/questions/231767/answers"
176
- "?order=desc&sort=votes&site=stackoverflow&filter=withbody&pagesize=3"
177
- ))
178
- for a in data['items']:
179
- print(f"Score: {a['score']}, Accepted: {a.get('is_accepted')}")
180
- print(f" Body preview: {a['body'][:150]}")
181
- # Score: 18307, Accepted: True
182
- # Body preview: <p>To understand what <a href="...">yield</a> does, ...
183
- # Score: 2596, Accepted: False
184
- # Score: 802, Accepted: False
185
- ```
186
-
187
- Answer fields (with `filter=withbody`): `answer_id`, `question_id`, `score`, `is_accepted`, `body`, `owner`, `creation_date`, `last_activity_date`, `content_license`.
188
-
189
- ### Fetch questions by ID (batch)
190
-
191
- Fetch up to 100 questions in one call using semicolons:
192
-
193
- ```python
194
- import json
195
- data = json.loads(http_get(
196
- "https://api.stackexchange.com/2.3/questions/231767;419163;394809"
197
- "?site=stackoverflow&filter=withbody"
198
- ))
199
- print("Fetched:", len(data['items'])) # 3
200
- for q in data['items']:
201
- print(q['question_id'], q['score'], q['title'][:50])
202
- # 231767 13133 What does the &quot;yield&quot; keyword do in Pyth
203
- # 419163 8438 What does if __name__ == &quot;__main__&quot;: do?
204
- # 394809 8125 Does Python have a ternary conditional operator?
205
- ```
206
-
207
- ### Search — `search/advanced` vs `search`
208
-
209
- Use `/search/advanced` when you need combined keyword + tag filtering. Use `/search` when searching only by title keyword (`intitle=`).
210
-
211
- ```python
212
- import json
213
-
214
- # search/advanced: keyword in body OR title, filtered by tag, sorted by relevance
215
- data = json.loads(http_get(
216
- "https://api.stackexchange.com/2.3/search/advanced"
217
- "?q=asyncio+event+loop&tagged=python&site=stackoverflow&pagesize=5&order=desc&sort=relevance"
218
- ))
219
- for q in data['items']:
220
- print(q['score'], q['answer_count'], q['title'][:70])
221
- # 137 3 "Asyncio Event Loop is Closed" when getting loop
222
- # 47 3 Can an asyncio event loop run in the background without suspending the
223
-
224
- # search: title-only keyword search via intitle=
225
- data = json.loads(http_get(
226
- "https://api.stackexchange.com/2.3/search"
227
- "?intitle=asyncio+event+loop&site=stackoverflow&pagesize=5&order=desc&sort=relevance"
228
- ))
229
- ```
230
-
231
- `search/advanced` additional params: `accepted=True` (only questions with accepted answers), `answers=1` (minimum answer count), `body=` (keyword in body), `user=` (filter by owner user ID), `views=` (minimum view count), `fromdate=`/`todate=` (Unix timestamps).
232
-
233
- ### User profile
234
-
235
- ```python
236
- import json
237
-
238
- # Basic user info
239
- user = json.loads(http_get("https://api.stackexchange.com/2.3/users/1?site=stackoverflow"))
240
- u = user['items'][0]
241
- print("User:", u['display_name'], "Rep:", u['reputation'], "Badges:", u['badge_counts'])
242
- # User: Jeff Atwood Rep: 64159 Badges: {'bronze': 153, 'silver': 153, 'gold': 48}
243
-
244
- # Fields: user_id, display_name, reputation, badge_counts, location, link,
245
- # creation_date, last_access_date, is_employee, account_id,
246
- # accept_rate, profile_image, website_url
247
-
248
- # Timeline (badge, question, answer events)
249
- data = json.loads(http_get("https://api.stackexchange.com/2.3/users/1/timeline?site=stackoverflow&pagesize=5"))
250
- print("Event types:", set(i['timeline_type'] for i in data['items']))
251
- # {'badge'}
252
-
253
- # User's top answers
254
- answers = json.loads(http_get("https://api.stackexchange.com/2.3/users/1/answers?site=stackoverflow&pagesize=5&order=desc&sort=votes"))
255
- for a in answers['items']:
256
- print("Score:", a['score'], "Question ID:", a.get('question_id'))
257
-
258
- # User's questions
259
- questions = json.loads(http_get("https://api.stackexchange.com/2.3/users/1/questions?site=stackoverflow&pagesize=3&order=desc&sort=votes"))
260
- for q in questions['items']:
261
- print(q['question_id'], q['score'], q['title'][:60])
262
- # 9 2273 How do I calculate someone&#39;s age based on a DateTime typ
263
- # 11 1656 Calculate relative time in C#
264
- ```
265
-
266
- ### Comments (requires `filter=withbody`)
267
-
268
- ```python
269
- import json
270
- data = json.loads(http_get(
271
- "https://api.stackexchange.com/2.3/questions/231767/comments"
272
- "?site=stackoverflow&pagesize=5&order=desc&sort=creation&filter=withbody"
273
- ))
274
- for c in data['items']:
275
- print("Score:", c['score'], "Body:", c.get('body','')[:80])
276
- # Comment keys (without filter): comment_id, content_license, creation_date,
277
- # edited, owner, post_id, reply_to_user, score
278
- # With filter=withbody: adds 'body' field (HTML-encoded)
279
- ```
280
-
281
- ### Related questions
282
-
283
- ```python
284
- import json
285
- related = json.loads(http_get(
286
- "https://api.stackexchange.com/2.3/questions/231767/related?site=stackoverflow&pagesize=5"
287
- ))
288
- for q in related['items']:
289
- print(q['question_id'], q['score'], q['title'][:60])
290
- # 25232350 15 how generators work in python
291
- # 28880095 11 What does a plain yield keyword do in Python?
292
- ```
293
-
294
- ### Popular tags
295
-
296
- ```python
297
- import json
298
- tags = json.loads(http_get("https://api.stackexchange.com/2.3/tags?order=desc&sort=popular&site=stackoverflow&pagesize=5"))
299
- for t in tags['items']:
300
- print(f"{t['name']}: {t['count']:,} questions")
301
- # javascript: 2,531,995 questions
302
- # java: 1,921,907 questions
303
- # c#: 1,626,728 questions
304
- # python: (check live — grows daily)
305
- ```
306
-
307
- ---
308
-
309
- ## Pagination
310
-
311
- Use `page=` (1-indexed) and `pagesize=` (max 100). Check `has_more` in the envelope to know whether a next page exists.
312
-
313
- ```python
314
- import json
315
-
316
- def fetch_all_pages(url_base, max_pages=5):
317
- """Fetch multiple pages from any Stack Exchange API endpoint."""
318
- results = []
319
- for page in range(1, max_pages + 1):
320
- data = json.loads(http_get(f"{url_base}&page={page}"))
321
- results.extend(data['items'])
322
- if not data.get('has_more'):
323
- break
324
- if data.get('backoff'):
325
- import time; time.sleep(data['backoff'])
326
- return results
327
-
328
- questions = fetch_all_pages(
329
- "https://api.stackexchange.com/2.3/questions?order=desc&sort=votes"
330
- "&tagged=python&site=stackoverflow&pagesize=10",
331
- max_pages=3
332
- )
333
- print("Total fetched:", len(questions)) # up to 30
334
- ```
335
-
336
- Note: `page=2` with `pagesize=3` returns the 4th–6th items. Confirmed working — `has_more: True` on page 2 of top Python questions.
337
-
338
- ---
339
-
340
- ## Parallel fetching (multiple questions or answers)
341
-
342
- ```python
343
- import json
344
- from concurrent.futures import ThreadPoolExecutor
345
-
346
- def fetch_top_answer(qid):
347
- data = json.loads(http_get(
348
- f"https://api.stackexchange.com/2.3/questions/{qid}/answers"
349
- "?order=desc&sort=votes&site=stackoverflow&filter=withbody&pagesize=1"
350
- ))
351
- if data['items']:
352
- a = data['items'][0]
353
- return {"qid": qid, "top_score": a['score'], "accepted": a.get('is_accepted')}
354
- return {"qid": qid, "top_score": 0}
355
-
356
- qids = [231767, 419163, 394809, 100003, 82831]
357
- with ThreadPoolExecutor(max_workers=3) as ex:
358
- results = list(ex.map(fetch_top_answer, qids))
359
-
360
- for r in results:
361
- print(r)
362
- # {'qid': 231767, 'top_score': 18307, 'accepted': True}
363
- # {'qid': 419163, 'top_score': 9051, 'accepted': True}
364
- # {'qid': 394809, 'top_score': 9355, 'accepted': True}
365
- # {'qid': 100003, 'top_score': 9334, 'accepted': False}
366
- # {'qid': 82831, 'top_score': 6793, 'accepted': False}
367
- ```
368
-
369
- Keep `max_workers` at 3 or below when unauthenticated — parallel calls consume quota simultaneously. At 3 workers, 5 questions used 5 quota units (expected).
370
-
371
- ---
372
-
373
- ## HTML page scraping (avoid for data tasks)
374
-
375
- The HTML page works but returns 777KB and has no clean `QAPage` JSON-LD. Use it only when you need something not in the API (e.g. rendered MathJax, ads context).
376
-
377
- ```python
378
- import re, html as htmllib
379
- headers = {"User-Agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36"}
380
- page = http_get("https://stackoverflow.com/questions/231767/what-does-the-yield-keyword-do-in-python", headers=headers)
381
- print("HTML length:", len(page)) # 777138
382
-
383
- # Page title (includes site suffix)
384
- title_m = re.search(r'<title>([^<]+)</title>', page)
385
- if title_m:
386
- print(htmllib.unescape(title_m.group(1)))
387
- # 'iterator - What does the "yield" keyword do in Python? - Stack Overflow'
388
-
389
- # Answer count via itemprop
390
- ans_count = re.search(r'itemprop="answerCount"[^>]*>(\d+)<', page)
391
- if ans_count:
392
- print("Answers:", ans_count.group(1)) # '51'
393
-
394
- # Score via itemprop (has whitespace around number)
395
- score_m = re.search(r'itemprop="upvoteCount"[^>]*>\s*(-?\d+)\s*<', page)
396
- if score_m:
397
- print("Score:", score_m.group(1)) # '13133'
398
-
399
- # JSON-LD is present but only has WebSite and Organization — NOT QAPage/Question
400
- ld_match = re.search(r'<script type="application/ld\+json">(.*?)</script>', page, re.DOTALL)
401
- if ld_match:
402
- d = json.loads(ld_match.group(1))
403
- types = [item.get('@type') for item in d.get('@graph', [])]
404
- print("JSON-LD types:", types) # ['WebSite', 'Organization'] — no QAPage
405
- ```
406
-
407
- ---
408
-
409
- ## Gotchas
410
-
411
- - **300 req/day unauthenticated is per IP, resets at midnight UTC.** 6 tests consumed ~27 quota units in one session. With parallel workers and loops, you can burn through 300 in minutes. Always check `quota_remaining` in responses.
412
-
413
- - **`filter=withbody` is required for body content.** Without it, `body` is simply absent from the response — no error, no empty string, just a missing key. Applies to questions, answers, AND comments.
414
-
415
- - **Title field has HTML entities, body field has full HTML markup.** They need different decoding strategies: `html.unescape()` for titles, `HTMLParser` stripping for bodies. Don't confuse them.
416
-
417
- - **Titles in API responses contain `&quot;`, `&lt;`, `&amp;`, `&#39;`** — raw output is `What does the &quot;yield&quot; keyword do in Python?`. Always call `html.unescape()` before displaying or comparing.
418
-
419
- - **Batch IDs with semicolons, not commas.** `/questions/231767;419163;394809` fetches 3 questions in one API call. Using commas returns a 400 error.
420
-
421
- - **`search/advanced` includes body text in results; `/search` only searches titles.** Use `search/advanced` with `q=` for full-text search. Use `/search` with `intitle=` for title-only.
422
-
423
- - **HTTP errors are raised as exceptions, not returned as JSON.** A bad `site=` param causes `urllib.error.HTTPError: HTTP Error 400: Bad Request` — there's no JSON body accessible from `http_get`. Wrap API calls in try/except.
424
-
425
- - **`backoff` in the response envelope must be respected.** If `data.get('backoff')` returns an integer (rare, typically 10–30 seconds), sleep that duration before the next call. Ignoring it will cause throttle errors on subsequent requests.
426
-
427
- - **`/info` endpoint wraps stats inside `items[0]`**, not directly in the envelope. Access as `data['items'][0]['total_questions']`.
428
-
429
- - **JSON-LD on the HTML page is NOT QAPage schema.** The `<script type="application/ld+json">` block only contains `WebSite` and `Organization` objects in the `@graph` array. There is no `Question`, `Answer`, or `QAPage` type — confirmed on the most-voted Python question (231767). Don't rely on structured data from the HTML page.
430
-
431
- - **User timeline `timeline_type` can be `badge`, `question`, `answer`, `comment`, `revision`, `suggested_edit`, `accepted`.** For very old/inactive users, all recent events may be `badge` only.
432
-
433
- - **Multi-site support.** Change `site=stackoverflow` to any Stack Exchange site: `site=superuser`, `site=serverfault`, `site=askubuntu`, `site=unix`, `site=datascience`, `site=math`. Same API, same quota pool per IP.
434
-
435
- - **`pagesize` max is 100.** Requesting more returns a 400 error. For bulk fetching, loop with `page=` and check `has_more`.
1
+ # Stack Overflow — Scraping & Data Extraction
2
+
3
+ `https://stackoverflow.com` — all public read-only data is available via the Stack Exchange API v2.3. No auth, no browser required for any read operation. API is fast, returns gzip-compressed JSON, and works transparently with `http_get`.
4
+
5
+ ## Do this first: pick your access path
6
+
7
+ | Goal | Best approach | Notes |
8
+ |------|--------------|-------|
9
+ | Top/hot questions by tag | `GET /2.3/questions` | Add `filter=withbody` for question text |
10
+ | Answers for a question | `GET /2.3/questions/{id}/answers` | Add `filter=withbody` for answer text |
11
+ | Search by keyword + tag | `GET /2.3/search/advanced` | More filters than `/search` |
12
+ | Simple title keyword search | `GET /2.3/search` | `intitle=` param |
13
+ | Fetch by known question IDs | `GET /2.3/questions/{id1};{id2};...` | Semicolon-delimited batch, up to 100 |
14
+ | User profile + reputation | `GET /2.3/users/{id}` | Public fields only |
15
+ | User activity timeline | `GET /2.3/users/{id}/timeline` | Events: badges, answers, questions |
16
+ | User's questions / answers | `GET /2.3/users/{id}/questions` or `/answers` | Standard listing |
17
+ | Comments on a post | `GET /2.3/questions/{id}/comments` | Needs `filter=withbody` for body |
18
+ | Related questions | `GET /2.3/questions/{id}/related` | Returns linked/similar questions |
19
+ | Answer by ID directly | `GET /2.3/answers/{id}` | One or more semicolon-separated IDs |
20
+ | Popular tags | `GET /2.3/tags` | Sort by `popular`, `activity`, or `name` |
21
+ | Site-wide statistics | `GET /2.3/info` | Total questions, quota, etc. |
22
+ | Question HTML page | `http_get` with User-Agent | Returns 777KB HTML; prefer API |
23
+
24
+ **Use the API for all data tasks.** The HTML page is 777KB, lacks clean structure, and the JSON-LD block only contains `WebSite` and `Organization` objects (no `QAPage` or `Question` schema). The API returns the same data in milliseconds, fully structured.
25
+
26
+ ---
27
+
28
+ ## Quota limits
29
+
30
+ The API is unauthenticated-friendly but strictly quota-capped per IP per day:
31
+
32
+ | Auth level | Daily quota | Burst |
33
+ |------------|-------------|-------|
34
+ | No key (unauthenticated) | **300 requests/day** | No enforced burst limit observed |
35
+ | With API key | **10,000 requests/day** | Same |
36
+
37
+ Check your remaining quota in every response envelope:
38
+
39
+ ```python
40
+ import json
41
+ data = json.loads(http_get("https://api.stackexchange.com/2.3/info?site=stackoverflow"))
42
+ print("Quota remaining:", data.get('quota_remaining')) # e.g. 273
43
+ print("Quota max:", data.get('quota_max')) # 300 unauthenticated, 10000 with key
44
+ # Confirmed: quota_max=300, quota_remaining decrements per call
45
+ ```
46
+
47
+ Every API response includes `quota_remaining` in the envelope. Monitor it. When it hits 0, all calls return HTTP 400 with `error_id: 502` (throttle_violation). There is no retry-after header — wait until midnight UTC.
48
+
49
+ **If you have an API key**, append `&key=YOUR_KEY` to any URL to use the 10,000/day quota.
50
+
51
+ ---
52
+
53
+ ## Response envelope
54
+
55
+ Every response from the Stack Exchange API is wrapped in a consistent envelope:
56
+
57
+ ```python
58
+ {
59
+ "items": [...], # list of result objects
60
+ "has_more": True/False, # whether more pages exist
61
+ "quota_max": 300, # total daily quota
62
+ "quota_remaining": 273, # calls left today
63
+ "backoff": None # seconds to wait before next call (rare)
64
+ }
65
+ ```
66
+
67
+ Always check `data.get('backoff')` — if it returns an integer, sleep that many seconds before the next call. Ignoring it causes throttle errors.
68
+
69
+ Error responses raise `urllib.error.HTTPError` (not a JSON envelope):
70
+ - HTTP 400 — invalid parameter (e.g. bad site name) — raises exception
71
+ - HTTP 400 with JSON body — quota exhausted or throttle_violation
72
+
73
+ ```python
74
+ try:
75
+ data = json.loads(http_get("https://api.stackexchange.com/2.3/questions?site=stackoverflow&pagesize=1"))
76
+ except Exception as e:
77
+ print("API error:", e) # HTTPError HTTP Error 400: Bad Request
78
+ ```
79
+
80
+ ---
81
+
82
+ ## `filter=withbody` — required for post content
83
+
84
+ By default, the API strips the `body` field from all responses. You **must** add `filter=withbody` to get question or answer text. This applies to questions, answers, and comments alike.
85
+
86
+ ```python
87
+ import json
88
+
89
+ # WITHOUT filter=withbody — body field is ABSENT
90
+ data = json.loads(http_get("https://api.stackexchange.com/2.3/questions?order=desc&sort=votes&tagged=python&site=stackoverflow&pagesize=1"))
91
+ q = data['items'][0]
92
+ print("Has body:", 'body' in q) # False
93
+ print("Keys:", sorted(q.keys()))
94
+ # ['accepted_answer_id', 'answer_count', 'content_license', 'creation_date',
95
+ # 'is_answered', 'last_activity_date', 'last_edit_date', 'link', 'owner',
96
+ # 'protected_date', 'question_id', 'score', 'tags', 'title', 'view_count']
97
+
98
+ # WITH filter=withbody — body field is PRESENT
99
+ data = json.loads(http_get("https://api.stackexchange.com/2.3/questions?order=desc&sort=votes&tagged=python&site=stackoverflow&pagesize=1&filter=withbody"))
100
+ q = data['items'][0]
101
+ print("Has body:", 'body' in q) # True
102
+ print("Body preview:", q['body'][:60])
103
+ # '<p>What functionality does the <a href="https://do...'
104
+ ```
105
+
106
+ ---
107
+
108
+ ## HTML encoding in API responses
109
+
110
+ The API returns HTML in two contexts, and plain text in a third:
111
+
112
+ - **`body` field** (questions, answers, comments) — full HTML markup. Headings, code blocks, links, blockquotes, lists. Strip with `html.parser` for plain text.
113
+ - **`title` field** — HTML-entity-encoded plain text. Quotes, angle brackets, and ampersands are escaped (`&quot;`, `&lt;`, `&amp;`). Decode with `html.unescape()`.
114
+ - **`display_name`, `link`, `tags`** — plain text, no encoding.
115
+
116
+ ```python
117
+ import json, html
118
+ from html.parser import HTMLParser
119
+
120
+ data = json.loads(http_get("https://api.stackexchange.com/2.3/questions/231767?site=stackoverflow&filter=withbody"))
121
+ q = data['items'][0]
122
+
123
+ # Title has HTML entities
124
+ print("Raw title:", q['title'])
125
+ # 'What does the &quot;yield&quot; keyword do in Python?'
126
+ print("Decoded:", html.unescape(q['title']))
127
+ # 'What does the "yield" keyword do in Python?'
128
+
129
+ # Body is full HTML — strip for plain text
130
+ class Stripper(HTMLParser):
131
+ def __init__(self):
132
+ super().__init__()
133
+ self.text = []
134
+ def handle_data(self, d):
135
+ self.text.append(d)
136
+ def get_text(self):
137
+ return ''.join(self.text)
138
+
139
+ s = Stripper()
140
+ s.feed(q['body'])
141
+ print(s.get_text()[:200])
142
+ # 'What functionality does the yield keyword do in Python?\nWhat is the ...'
143
+ ```
144
+
145
+ ---
146
+
147
+ ## Common workflows
148
+
149
+ ### Top questions by tag (API)
150
+
151
+ ```python
152
+ import json, html
153
+ data = json.loads(http_get(
154
+ "https://api.stackexchange.com/2.3/questions"
155
+ "?order=desc&sort=votes&tagged=python&site=stackoverflow&pagesize=5&filter=withbody"
156
+ ))
157
+ for q in data['items']:
158
+ print(q['question_id'], q['score'], html.unescape(q['title'])[:60])
159
+ print(" Tags:", q['tags'][:3], "Answers:", q['answer_count'])
160
+ print("Quota remaining:", data.get('quota_remaining'))
161
+ # 231767 13133 What does the "yield" keyword do in Python?
162
+ # Tags: ['python', 'iterator', 'generator'] Answers: 51
163
+ # 419163 8438 What does if __name__ == "__main__": do?
164
+ # Tags: ['python', 'namespaces', 'program-entry-point'] Answers: 40
165
+ # Quota remaining: 299
166
+ ```
167
+
168
+ Sort options for `/questions`: `activity`, `votes`, `creation`, `hot`, `week`, `month`.
169
+
170
+ ### Answers for a question
171
+
172
+ ```python
173
+ import json
174
+ data = json.loads(http_get(
175
+ "https://api.stackexchange.com/2.3/questions/231767/answers"
176
+ "?order=desc&sort=votes&site=stackoverflow&filter=withbody&pagesize=3"
177
+ ))
178
+ for a in data['items']:
179
+ print(f"Score: {a['score']}, Accepted: {a.get('is_accepted')}")
180
+ print(f" Body preview: {a['body'][:150]}")
181
+ # Score: 18307, Accepted: True
182
+ # Body preview: <p>To understand what <a href="...">yield</a> does, ...
183
+ # Score: 2596, Accepted: False
184
+ # Score: 802, Accepted: False
185
+ ```
186
+
187
+ Answer fields (with `filter=withbody`): `answer_id`, `question_id`, `score`, `is_accepted`, `body`, `owner`, `creation_date`, `last_activity_date`, `content_license`.
188
+
189
+ ### Fetch questions by ID (batch)
190
+
191
+ Fetch up to 100 questions in one call using semicolons:
192
+
193
+ ```python
194
+ import json
195
+ data = json.loads(http_get(
196
+ "https://api.stackexchange.com/2.3/questions/231767;419163;394809"
197
+ "?site=stackoverflow&filter=withbody"
198
+ ))
199
+ print("Fetched:", len(data['items'])) # 3
200
+ for q in data['items']:
201
+ print(q['question_id'], q['score'], q['title'][:50])
202
+ # 231767 13133 What does the &quot;yield&quot; keyword do in Pyth
203
+ # 419163 8438 What does if __name__ == &quot;__main__&quot;: do?
204
+ # 394809 8125 Does Python have a ternary conditional operator?
205
+ ```
206
+
207
+ ### Search — `search/advanced` vs `search`
208
+
209
+ Use `/search/advanced` when you need combined keyword + tag filtering. Use `/search` when searching only by title keyword (`intitle=`).
210
+
211
+ ```python
212
+ import json
213
+
214
+ # search/advanced: keyword in body OR title, filtered by tag, sorted by relevance
215
+ data = json.loads(http_get(
216
+ "https://api.stackexchange.com/2.3/search/advanced"
217
+ "?q=asyncio+event+loop&tagged=python&site=stackoverflow&pagesize=5&order=desc&sort=relevance"
218
+ ))
219
+ for q in data['items']:
220
+ print(q['score'], q['answer_count'], q['title'][:70])
221
+ # 137 3 "Asyncio Event Loop is Closed" when getting loop
222
+ # 47 3 Can an asyncio event loop run in the background without suspending the
223
+
224
+ # search: title-only keyword search via intitle=
225
+ data = json.loads(http_get(
226
+ "https://api.stackexchange.com/2.3/search"
227
+ "?intitle=asyncio+event+loop&site=stackoverflow&pagesize=5&order=desc&sort=relevance"
228
+ ))
229
+ ```
230
+
231
+ `search/advanced` additional params: `accepted=True` (only questions with accepted answers), `answers=1` (minimum answer count), `body=` (keyword in body), `user=` (filter by owner user ID), `views=` (minimum view count), `fromdate=`/`todate=` (Unix timestamps).
232
+
233
+ ### User profile
234
+
235
+ ```python
236
+ import json
237
+
238
+ # Basic user info
239
+ user = json.loads(http_get("https://api.stackexchange.com/2.3/users/1?site=stackoverflow"))
240
+ u = user['items'][0]
241
+ print("User:", u['display_name'], "Rep:", u['reputation'], "Badges:", u['badge_counts'])
242
+ # User: Jeff Atwood Rep: 64159 Badges: {'bronze': 153, 'silver': 153, 'gold': 48}
243
+
244
+ # Fields: user_id, display_name, reputation, badge_counts, location, link,
245
+ # creation_date, last_access_date, is_employee, account_id,
246
+ # accept_rate, profile_image, website_url
247
+
248
+ # Timeline (badge, question, answer events)
249
+ data = json.loads(http_get("https://api.stackexchange.com/2.3/users/1/timeline?site=stackoverflow&pagesize=5"))
250
+ print("Event types:", set(i['timeline_type'] for i in data['items']))
251
+ # {'badge'}
252
+
253
+ # User's top answers
254
+ answers = json.loads(http_get("https://api.stackexchange.com/2.3/users/1/answers?site=stackoverflow&pagesize=5&order=desc&sort=votes"))
255
+ for a in answers['items']:
256
+ print("Score:", a['score'], "Question ID:", a.get('question_id'))
257
+
258
+ # User's questions
259
+ questions = json.loads(http_get("https://api.stackexchange.com/2.3/users/1/questions?site=stackoverflow&pagesize=3&order=desc&sort=votes"))
260
+ for q in questions['items']:
261
+ print(q['question_id'], q['score'], q['title'][:60])
262
+ # 9 2273 How do I calculate someone&#39;s age based on a DateTime typ
263
+ # 11 1656 Calculate relative time in C#
264
+ ```
265
+
266
+ ### Comments (requires `filter=withbody`)
267
+
268
+ ```python
269
+ import json
270
+ data = json.loads(http_get(
271
+ "https://api.stackexchange.com/2.3/questions/231767/comments"
272
+ "?site=stackoverflow&pagesize=5&order=desc&sort=creation&filter=withbody"
273
+ ))
274
+ for c in data['items']:
275
+ print("Score:", c['score'], "Body:", c.get('body','')[:80])
276
+ # Comment keys (without filter): comment_id, content_license, creation_date,
277
+ # edited, owner, post_id, reply_to_user, score
278
+ # With filter=withbody: adds 'body' field (HTML-encoded)
279
+ ```
280
+
281
+ ### Related questions
282
+
283
+ ```python
284
+ import json
285
+ related = json.loads(http_get(
286
+ "https://api.stackexchange.com/2.3/questions/231767/related?site=stackoverflow&pagesize=5"
287
+ ))
288
+ for q in related['items']:
289
+ print(q['question_id'], q['score'], q['title'][:60])
290
+ # 25232350 15 how generators work in python
291
+ # 28880095 11 What does a plain yield keyword do in Python?
292
+ ```
293
+
294
+ ### Popular tags
295
+
296
+ ```python
297
+ import json
298
+ tags = json.loads(http_get("https://api.stackexchange.com/2.3/tags?order=desc&sort=popular&site=stackoverflow&pagesize=5"))
299
+ for t in tags['items']:
300
+ print(f"{t['name']}: {t['count']:,} questions")
301
+ # javascript: 2,531,995 questions
302
+ # java: 1,921,907 questions
303
+ # c#: 1,626,728 questions
304
+ # python: (check live — grows daily)
305
+ ```
306
+
307
+ ---
308
+
309
+ ## Pagination
310
+
311
+ Use `page=` (1-indexed) and `pagesize=` (max 100). Check `has_more` in the envelope to know whether a next page exists.
312
+
313
+ ```python
314
+ import json
315
+
316
+ def fetch_all_pages(url_base, max_pages=5):
317
+ """Fetch multiple pages from any Stack Exchange API endpoint."""
318
+ results = []
319
+ for page in range(1, max_pages + 1):
320
+ data = json.loads(http_get(f"{url_base}&page={page}"))
321
+ results.extend(data['items'])
322
+ if not data.get('has_more'):
323
+ break
324
+ if data.get('backoff'):
325
+ import time; time.sleep(data['backoff'])
326
+ return results
327
+
328
+ questions = fetch_all_pages(
329
+ "https://api.stackexchange.com/2.3/questions?order=desc&sort=votes"
330
+ "&tagged=python&site=stackoverflow&pagesize=10",
331
+ max_pages=3
332
+ )
333
+ print("Total fetched:", len(questions)) # up to 30
334
+ ```
335
+
336
+ Note: `page=2` with `pagesize=3` returns the 4th–6th items. Confirmed working — `has_more: True` on page 2 of top Python questions.
337
+
338
+ ---
339
+
340
+ ## Parallel fetching (multiple questions or answers)
341
+
342
+ ```python
343
+ import json
344
+ from concurrent.futures import ThreadPoolExecutor
345
+
346
+ def fetch_top_answer(qid):
347
+ data = json.loads(http_get(
348
+ f"https://api.stackexchange.com/2.3/questions/{qid}/answers"
349
+ "?order=desc&sort=votes&site=stackoverflow&filter=withbody&pagesize=1"
350
+ ))
351
+ if data['items']:
352
+ a = data['items'][0]
353
+ return {"qid": qid, "top_score": a['score'], "accepted": a.get('is_accepted')}
354
+ return {"qid": qid, "top_score": 0}
355
+
356
+ qids = [231767, 419163, 394809, 100003, 82831]
357
+ with ThreadPoolExecutor(max_workers=3) as ex:
358
+ results = list(ex.map(fetch_top_answer, qids))
359
+
360
+ for r in results:
361
+ print(r)
362
+ # {'qid': 231767, 'top_score': 18307, 'accepted': True}
363
+ # {'qid': 419163, 'top_score': 9051, 'accepted': True}
364
+ # {'qid': 394809, 'top_score': 9355, 'accepted': True}
365
+ # {'qid': 100003, 'top_score': 9334, 'accepted': False}
366
+ # {'qid': 82831, 'top_score': 6793, 'accepted': False}
367
+ ```
368
+
369
+ Keep `max_workers` at 3 or below when unauthenticated — parallel calls consume quota simultaneously. At 3 workers, 5 questions used 5 quota units (expected).
370
+
371
+ ---
372
+
373
+ ## HTML page scraping (avoid for data tasks)
374
+
375
+ The HTML page works but returns 777KB and has no clean `QAPage` JSON-LD. Use it only when you need something not in the API (e.g. rendered MathJax, ads context).
376
+
377
+ ```python
378
+ import re, html as htmllib
379
+ headers = {"User-Agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36"}
380
+ page = http_get("https://stackoverflow.com/questions/231767/what-does-the-yield-keyword-do-in-python", headers=headers)
381
+ print("HTML length:", len(page)) # 777138
382
+
383
+ # Page title (includes site suffix)
384
+ title_m = re.search(r'<title>([^<]+)</title>', page)
385
+ if title_m:
386
+ print(htmllib.unescape(title_m.group(1)))
387
+ # 'iterator - What does the "yield" keyword do in Python? - Stack Overflow'
388
+
389
+ # Answer count via itemprop
390
+ ans_count = re.search(r'itemprop="answerCount"[^>]*>(\d+)<', page)
391
+ if ans_count:
392
+ print("Answers:", ans_count.group(1)) # '51'
393
+
394
+ # Score via itemprop (has whitespace around number)
395
+ score_m = re.search(r'itemprop="upvoteCount"[^>]*>\s*(-?\d+)\s*<', page)
396
+ if score_m:
397
+ print("Score:", score_m.group(1)) # '13133'
398
+
399
+ # JSON-LD is present but only has WebSite and Organization — NOT QAPage/Question
400
+ ld_match = re.search(r'<script type="application/ld\+json">(.*?)</script>', page, re.DOTALL)
401
+ if ld_match:
402
+ d = json.loads(ld_match.group(1))
403
+ types = [item.get('@type') for item in d.get('@graph', [])]
404
+ print("JSON-LD types:", types) # ['WebSite', 'Organization'] — no QAPage
405
+ ```
406
+
407
+ ---
408
+
409
+ ## Gotchas
410
+
411
+ - **300 req/day unauthenticated is per IP, resets at midnight UTC.** 6 tests consumed ~27 quota units in one session. With parallel workers and loops, you can burn through 300 in minutes. Always check `quota_remaining` in responses.
412
+
413
+ - **`filter=withbody` is required for body content.** Without it, `body` is simply absent from the response — no error, no empty string, just a missing key. Applies to questions, answers, AND comments.
414
+
415
+ - **Title field has HTML entities, body field has full HTML markup.** They need different decoding strategies: `html.unescape()` for titles, `HTMLParser` stripping for bodies. Don't confuse them.
416
+
417
+ - **Titles in API responses contain `&quot;`, `&lt;`, `&amp;`, `&#39;`** — raw output is `What does the &quot;yield&quot; keyword do in Python?`. Always call `html.unescape()` before displaying or comparing.
418
+
419
+ - **Batch IDs with semicolons, not commas.** `/questions/231767;419163;394809` fetches 3 questions in one API call. Using commas returns a 400 error.
420
+
421
+ - **`search/advanced` includes body text in results; `/search` only searches titles.** Use `search/advanced` with `q=` for full-text search. Use `/search` with `intitle=` for title-only.
422
+
423
+ - **HTTP errors are raised as exceptions, not returned as JSON.** A bad `site=` param causes `urllib.error.HTTPError: HTTP Error 400: Bad Request` — there's no JSON body accessible from `http_get`. Wrap API calls in try/except.
424
+
425
+ - **`backoff` in the response envelope must be respected.** If `data.get('backoff')` returns an integer (rare, typically 10–30 seconds), sleep that duration before the next call. Ignoring it will cause throttle errors on subsequent requests.
426
+
427
+ - **`/info` endpoint wraps stats inside `items[0]`**, not directly in the envelope. Access as `data['items'][0]['total_questions']`.
428
+
429
+ - **JSON-LD on the HTML page is NOT QAPage schema.** The `<script type="application/ld+json">` block only contains `WebSite` and `Organization` objects in the `@graph` array. There is no `Question`, `Answer`, or `QAPage` type — confirmed on the most-voted Python question (231767). Don't rely on structured data from the HTML page.
430
+
431
+ - **User timeline `timeline_type` can be `badge`, `question`, `answer`, `comment`, `revision`, `suggested_edit`, `accepted`.** For very old/inactive users, all recent events may be `badge` only.
432
+
433
+ - **Multi-site support.** Change `site=stackoverflow` to any Stack Exchange site: `site=superuser`, `site=serverfault`, `site=askubuntu`, `site=unix`, `site=datascience`, `site=math`. Same API, same quota pool per IP.
434
+
435
+ - **`pagesize` max is 100.** Requesting more returns a 400 error. For bulk fetching, loop with `page=` and check `has_more`.