@pencil-agent/nano-pencil 2.0.0 → 2.0.1

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
Files changed (195) hide show
  1. package/README.md +267 -267
  2. package/dist/build-meta.json +3 -3
  3. package/dist/core/export-html/AGENT.md +11 -11
  4. package/dist/core/export-html/template.css +971 -971
  5. package/dist/core/export-html/template.html +54 -54
  6. package/dist/core/mcp/mcp-client.d.ts +3 -1
  7. package/dist/core/mcp/mcp-client.js +6 -6
  8. package/dist/core/mcp/mcp-config.d.ts +3 -3
  9. package/dist/core/mcp/mcp-config.js +1 -1
  10. package/dist/core/mcp/mcp-manager.d.ts +5 -1
  11. package/dist/core/mcp/mcp-manager.js +1 -1
  12. package/dist/core/platform/config/resource-loader.d.ts +2 -0
  13. package/dist/core/platform/config/resource-loader.js +2 -2
  14. package/dist/core/runtime/agent-session.d.ts +12 -0
  15. package/dist/core/runtime/agent-session.js +8 -8
  16. package/dist/core/runtime/sdk.d.ts +8 -0
  17. package/dist/core/runtime/sdk.js +1 -1
  18. package/dist/extensions/builtin/AGENT.md +115 -115
  19. package/dist/extensions/builtin/browser/AGENT.md +17 -17
  20. package/dist/extensions/builtin/browser/agent-workspace/agent_helpers.py +12 -12
  21. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/amazon/product-search.md +198 -198
  22. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/archive-org/scraping.md +341 -341
  23. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/arxiv/scraping.md +311 -311
  24. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/arxiv-bulk/scraping.md +333 -333
  25. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/atlas/overview.md +70 -70
  26. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/booking-com/scraping.md +578 -578
  27. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/capterra/scraping.md +440 -440
  28. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/centilebrain/generate-estimates.md +110 -110
  29. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/coingecko/scraping.md +325 -325
  30. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/coinmarketcap/scraping.md +463 -463
  31. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/coursera/scraping.md +360 -360
  32. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/craigslist/scraping.md +390 -390
  33. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/crossref/scraping.md +568 -568
  34. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/dev-to/scraping.md +323 -323
  35. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/duckduckgo/scraping.md +349 -349
  36. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/ebay/scraping.md +435 -435
  37. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/etsy/scraping.md +506 -506
  38. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/eventbrite/scraping.md +363 -363
  39. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/expedia/automation.md +168 -168
  40. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/facebook/groups.md +236 -236
  41. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/facebook/pages.md +295 -295
  42. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/framer/editor.md +108 -108
  43. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/fred/scraping.md +493 -493
  44. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/g2/scraping.md +580 -580
  45. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/genius/scraping.md +511 -511
  46. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/github/repo-actions.md +65 -65
  47. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/github/scraping.md +184 -184
  48. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/glassdoor/scraping.md +543 -543
  49. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/gmail/compose.md +122 -122
  50. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/goodreads/scraping.md +461 -461
  51. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/gutenberg/scraping.md +383 -383
  52. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/hackernews/scraping.md +243 -243
  53. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/howlongtobeat/scraping.md +473 -473
  54. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/imdb/scraping.md +271 -271
  55. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/itch-io/scraping.md +436 -436
  56. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/job-boards/indeed-glassdoor.md +1021 -1021
  57. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/letterboxd/scraping.md +349 -349
  58. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/linkedin/invitation-manager.md +109 -109
  59. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/loom/folder-enumeration.md +170 -170
  60. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/macrotrends/scraping.md +537 -537
  61. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/medium/article-hydration.md +120 -120
  62. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/medium/scraping.md +414 -414
  63. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/metacritic/scraping.md +477 -477
  64. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/musicbrainz/scraping.md +478 -478
  65. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/nasa/scraping.md +339 -339
  66. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/news-aggregation/multi-source.md +205 -205
  67. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/open-library/scraping.md +472 -472
  68. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/openalex/scraping.md +470 -470
  69. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/openstreetmap/scraping.md +490 -490
  70. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/package-registries/npm-pypi.md +478 -478
  71. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/polymarket/scraping.md +234 -234
  72. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/producthunt/scraping.md +307 -307
  73. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/pubmed/scraping.md +421 -421
  74. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/quora/scraping.md +364 -364
  75. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/rawg/scraping.md +352 -352
  76. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/reddit/scraping.md +124 -124
  77. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/rest-countries/scraping.md +233 -233
  78. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/sec-edgar/scraping.md +361 -361
  79. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/shopify-admin/README.md +36 -36
  80. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/shopify-admin/embedded-apps.md +72 -72
  81. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/shopify-admin/knowledge-base.md +109 -109
  82. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/shopify-admin/polaris-inputs.md +137 -137
  83. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/soundcloud/scraping.md +362 -362
  84. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/spotify/scraping.md +339 -339
  85. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/stackoverflow/scraping.md +435 -435
  86. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/steam/scraping.md +575 -575
  87. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/substack/scraping.md +338 -338
  88. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/thetechgeeks/pricing.md +52 -52
  89. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/tiktok/upload.md +107 -107
  90. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/tradingview/scraping.md +309 -309
  91. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/trello/boards-and-lists.md +88 -88
  92. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/trustpilot/scraping.md +375 -375
  93. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/walmart/scraping.md +444 -444
  94. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/wayback-machine/scraping.md +306 -306
  95. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/weather/scraping.md +398 -398
  96. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/wellfound/scraping.md +596 -596
  97. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/world-bank/scraping.md +356 -356
  98. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/xiaohongshu/scraping.md +84 -84
  99. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/youtube/scraping.md +418 -418
  100. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/zillow/scraping.md +433 -433
  101. package/dist/extensions/builtin/browser/browser.md +73 -73
  102. package/dist/extensions/builtin/browser/install.md +142 -142
  103. package/dist/extensions/builtin/browser/interaction-skills/connection.md +48 -48
  104. package/dist/extensions/builtin/browser/interaction-skills/cookies.md +3 -3
  105. package/dist/extensions/builtin/browser/interaction-skills/cross-origin-iframes.md +3 -3
  106. package/dist/extensions/builtin/browser/interaction-skills/dialogs.md +64 -64
  107. package/dist/extensions/builtin/browser/interaction-skills/downloads.md +3 -3
  108. package/dist/extensions/builtin/browser/interaction-skills/drag-and-drop.md +3 -3
  109. package/dist/extensions/builtin/browser/interaction-skills/dropdowns.md +3 -3
  110. package/dist/extensions/builtin/browser/interaction-skills/iframes.md +3 -3
  111. package/dist/extensions/builtin/browser/interaction-skills/network-requests.md +3 -3
  112. package/dist/extensions/builtin/browser/interaction-skills/print-as-pdf.md +3 -3
  113. package/dist/extensions/builtin/browser/interaction-skills/profile-sync.md +90 -90
  114. package/dist/extensions/builtin/browser/interaction-skills/screenshots.md +17 -17
  115. package/dist/extensions/builtin/browser/interaction-skills/scrolling.md +3 -3
  116. package/dist/extensions/builtin/browser/interaction-skills/shadow-dom.md +3 -3
  117. package/dist/extensions/builtin/browser/interaction-skills/tabs.md +69 -69
  118. package/dist/extensions/builtin/browser/interaction-skills/uploads.md +1 -1
  119. package/dist/extensions/builtin/browser/interaction-skills/viewport.md +3 -3
  120. package/dist/extensions/builtin/browser/src/browser_harness/AGENT.md +15 -15
  121. package/dist/extensions/builtin/browser/src/browser_harness/__init__.py +8 -8
  122. package/dist/extensions/builtin/browser/src/browser_harness/_ipc.py +90 -90
  123. package/dist/extensions/builtin/browser/src/browser_harness/admin.py +722 -722
  124. package/dist/extensions/builtin/browser/src/browser_harness/daemon.py +328 -328
  125. package/dist/extensions/builtin/browser/src/browser_harness/helpers.py +396 -396
  126. package/dist/extensions/builtin/browser/src/browser_harness/run.py +103 -103
  127. package/dist/extensions/builtin/discipline/skills/brainstorming/SKILL.md +33 -33
  128. package/dist/extensions/builtin/discipline/skills/executing-plans/SKILL.md +25 -25
  129. package/dist/extensions/builtin/discipline/skills/finishing-development-branch/SKILL.md +25 -25
  130. package/dist/extensions/builtin/discipline/skills/receiving-code-review/SKILL.md +22 -22
  131. package/dist/extensions/builtin/discipline/skills/requesting-code-review/SKILL.md +31 -31
  132. package/dist/extensions/builtin/discipline/skills/systematic-debugging/SKILL.md +28 -28
  133. package/dist/extensions/builtin/discipline/skills/test-driven-development/SKILL.md +32 -32
  134. package/dist/extensions/builtin/discipline/skills/using-git-worktrees/SKILL.md +25 -25
  135. package/dist/extensions/builtin/discipline/skills/verification-before-completion/SKILL.md +27 -27
  136. package/dist/extensions/builtin/discipline/skills/writing-plans/SKILL.md +26 -26
  137. package/dist/extensions/builtin/goal/README.md +67 -67
  138. package/dist/extensions/builtin/grub/README.md +112 -112
  139. package/dist/extensions/builtin/link-world/agent-workspace/README.md +16 -16
  140. package/dist/extensions/builtin/link-world/internet-search/internet-search.md +65 -65
  141. package/dist/extensions/builtin/link-world/link-world-agent.md +82 -82
  142. package/dist/extensions/builtin/link-world/linkworld.md +313 -313
  143. package/dist/extensions/builtin/link-world/network-routing/network-routing.md +67 -67
  144. package/dist/extensions/builtin/loop/README.md +92 -92
  145. package/dist/extensions/builtin/mcp/figma-design.md +68 -68
  146. package/dist/extensions/builtin/mcp/mcp-management.md +85 -85
  147. package/dist/extensions/builtin/recap/AGENT.md +15 -15
  148. package/dist/extensions/builtin/sal/README.md +72 -72
  149. package/dist/extensions/builtin/security-audit/README.md +289 -289
  150. package/dist/extensions/builtin/team/AGENT.md +112 -112
  151. package/dist/extensions/builtin/team/TESTING.md +299 -299
  152. package/dist/extensions/builtin/token-save/README.md +56 -56
  153. package/dist/extensions/optional/AGENT.md +10 -10
  154. package/dist/modes/interactive/interactive-mode.js +36 -36
  155. package/dist/modes/interactive/theme/dark.json +85 -85
  156. package/dist/modes/interactive/theme/light.json +84 -84
  157. package/dist/modes/interactive/theme/theme-schema.json +335 -335
  158. package/dist/modes/interactive/theme/warm.json +81 -81
  159. package/dist/node_modules/@pencil-agent/agent-core/dist/agent-loop.js +3 -2
  160. package/dist/node_modules/@pencil-agent/agent-core/dist/structured-adaptive-agent-loop.js +2 -1
  161. package/dist/node_modules/@pencil-agent/ai/dist/cli.js +0 -0
  162. package/docs/cc-agent-design.md +1297 -0
  163. package/docs/cc-tui-design.md +1333 -0
  164. package/docs/codex-goal-command-impl.md +1055 -1055
  165. package/docs/codex-goal-vs-grub.md +500 -500
  166. package/docs/custom-provider.md +27 -27
  167. package/docs/extensions.md +27 -27
  168. package/docs/keybindings.md +27 -27
  169. package/docs/loop /351/207/215/346/236/204/345/256/214/346/210/220/346/200/273/347/273/223.md" +250 -250
  170. package/docs/loop /351/207/215/346/236/204/345/256/214/346/210/220/346/212/245/345/221/212.md" +122 -122
  171. package/docs/loop /351/207/215/346/236/204/346/226/271/346/241/210.md" +1222 -1222
  172. package/docs/loop /351/207/215/346/236/204/346/226/271/346/241/210/345/256/236/347/216/260/346/212/245/345/221/212.md" +158 -158
  173. package/docs/loop /351/207/215/346/236/204/346/226/271/346/241/210/345/257/271/346/257/224/345/210/206/346/236/220.md" +128 -128
  174. package/docs/loop /351/207/215/346/236/204/350/256/241/345/210/222.md" +320 -320
  175. package/docs/loop-usage-examples.md +214 -214
  176. package/docs/models.md +27 -27
  177. package/docs/nanoPencil-/345/255/246/344/271/240/350/256/241/345/210/222.md +170 -0
  178. package/docs/packages.md +27 -27
  179. package/docs/pi-design-philosophy.md +457 -457
  180. package/docs/planmode.md +1987 -1987
  181. package/docs/prompt-templates.md +27 -27
  182. package/docs/providers.md +27 -27
  183. package/docs/scan-report.md +3820 -0
  184. package/docs/sdk.md +27 -27
  185. package/docs/skills.md +27 -27
  186. package/docs/themes.md +27 -27
  187. package/docs/tui.md +27 -27
  188. package/docs//345/257/271/346/240/207Claude-Code.md +1775 -0
  189. package/docs//351/230/277/351/207/214/345/267/264/345/267/264/350/264/242/346/212/245/345/210/206/346/236/220/344/271/246.md +261 -0
  190. package/package.json +190 -190
  191. package/docs/ACP/345/215/217/350/256/256/351/233/206/346/210/220/345/274/200/345/217/221/346/226/207/346/241/243.md +0 -851
  192. package/docs/SDK-TESTING.md +0 -364
  193. package/docs/mem-core/346/212/200/346/234/257/346/226/207/346/241/243.md +0 -593
  194. package/docs/startup-performance-optimization.md +0 -301
  195. package/docs//350/256/244/347/237/245/345/234/260/345/233/276.md +0 -47
@@ -1,435 +1,435 @@
1
- # Stack Overflow — Scraping & Data Extraction
2
-
3
- `https://stackoverflow.com` — all public read-only data is available via the Stack Exchange API v2.3. No auth, no browser required for any read operation. API is fast, returns gzip-compressed JSON, and works transparently with `http_get`.
4
-
5
- ## Do this first: pick your access path
6
-
7
- | Goal | Best approach | Notes |
8
- |------|--------------|-------|
9
- | Top/hot questions by tag | `GET /2.3/questions` | Add `filter=withbody` for question text |
10
- | Answers for a question | `GET /2.3/questions/{id}/answers` | Add `filter=withbody` for answer text |
11
- | Search by keyword + tag | `GET /2.3/search/advanced` | More filters than `/search` |
12
- | Simple title keyword search | `GET /2.3/search` | `intitle=` param |
13
- | Fetch by known question IDs | `GET /2.3/questions/{id1};{id2};...` | Semicolon-delimited batch, up to 100 |
14
- | User profile + reputation | `GET /2.3/users/{id}` | Public fields only |
15
- | User activity timeline | `GET /2.3/users/{id}/timeline` | Events: badges, answers, questions |
16
- | User's questions / answers | `GET /2.3/users/{id}/questions` or `/answers` | Standard listing |
17
- | Comments on a post | `GET /2.3/questions/{id}/comments` | Needs `filter=withbody` for body |
18
- | Related questions | `GET /2.3/questions/{id}/related` | Returns linked/similar questions |
19
- | Answer by ID directly | `GET /2.3/answers/{id}` | One or more semicolon-separated IDs |
20
- | Popular tags | `GET /2.3/tags` | Sort by `popular`, `activity`, or `name` |
21
- | Site-wide statistics | `GET /2.3/info` | Total questions, quota, etc. |
22
- | Question HTML page | `http_get` with User-Agent | Returns 777KB HTML; prefer API |
23
-
24
- **Use the API for all data tasks.** The HTML page is 777KB, lacks clean structure, and the JSON-LD block only contains `WebSite` and `Organization` objects (no `QAPage` or `Question` schema). The API returns the same data in milliseconds, fully structured.
25
-
26
- ---
27
-
28
- ## Quota limits
29
-
30
- The API is unauthenticated-friendly but strictly quota-capped per IP per day:
31
-
32
- | Auth level | Daily quota | Burst |
33
- |------------|-------------|-------|
34
- | No key (unauthenticated) | **300 requests/day** | No enforced burst limit observed |
35
- | With API key | **10,000 requests/day** | Same |
36
-
37
- Check your remaining quota in every response envelope:
38
-
39
- ```python
40
- import json
41
- data = json.loads(http_get("https://api.stackexchange.com/2.3/info?site=stackoverflow"))
42
- print("Quota remaining:", data.get('quota_remaining')) # e.g. 273
43
- print("Quota max:", data.get('quota_max')) # 300 unauthenticated, 10000 with key
44
- # Confirmed: quota_max=300, quota_remaining decrements per call
45
- ```
46
-
47
- Every API response includes `quota_remaining` in the envelope. Monitor it. When it hits 0, all calls return HTTP 400 with `error_id: 502` (throttle_violation). There is no retry-after header — wait until midnight UTC.
48
-
49
- **If you have an API key**, append `&key=YOUR_KEY` to any URL to use the 10,000/day quota.
50
-
51
- ---
52
-
53
- ## Response envelope
54
-
55
- Every response from the Stack Exchange API is wrapped in a consistent envelope:
56
-
57
- ```python
58
- {
59
- "items": [...], # list of result objects
60
- "has_more": True/False, # whether more pages exist
61
- "quota_max": 300, # total daily quota
62
- "quota_remaining": 273, # calls left today
63
- "backoff": None # seconds to wait before next call (rare)
64
- }
65
- ```
66
-
67
- Always check `data.get('backoff')` — if it returns an integer, sleep that many seconds before the next call. Ignoring it causes throttle errors.
68
-
69
- Error responses raise `urllib.error.HTTPError` (not a JSON envelope):
70
- - HTTP 400 — invalid parameter (e.g. bad site name) — raises exception
71
- - HTTP 400 with JSON body — quota exhausted or throttle_violation
72
-
73
- ```python
74
- try:
75
- data = json.loads(http_get("https://api.stackexchange.com/2.3/questions?site=stackoverflow&pagesize=1"))
76
- except Exception as e:
77
- print("API error:", e) # HTTPError HTTP Error 400: Bad Request
78
- ```
79
-
80
- ---
81
-
82
- ## `filter=withbody` — required for post content
83
-
84
- By default, the API strips the `body` field from all responses. You **must** add `filter=withbody` to get question or answer text. This applies to questions, answers, and comments alike.
85
-
86
- ```python
87
- import json
88
-
89
- # WITHOUT filter=withbody — body field is ABSENT
90
- data = json.loads(http_get("https://api.stackexchange.com/2.3/questions?order=desc&sort=votes&tagged=python&site=stackoverflow&pagesize=1"))
91
- q = data['items'][0]
92
- print("Has body:", 'body' in q) # False
93
- print("Keys:", sorted(q.keys()))
94
- # ['accepted_answer_id', 'answer_count', 'content_license', 'creation_date',
95
- # 'is_answered', 'last_activity_date', 'last_edit_date', 'link', 'owner',
96
- # 'protected_date', 'question_id', 'score', 'tags', 'title', 'view_count']
97
-
98
- # WITH filter=withbody — body field is PRESENT
99
- data = json.loads(http_get("https://api.stackexchange.com/2.3/questions?order=desc&sort=votes&tagged=python&site=stackoverflow&pagesize=1&filter=withbody"))
100
- q = data['items'][0]
101
- print("Has body:", 'body' in q) # True
102
- print("Body preview:", q['body'][:60])
103
- # '<p>What functionality does the <a href="https://do...'
104
- ```
105
-
106
- ---
107
-
108
- ## HTML encoding in API responses
109
-
110
- The API returns HTML in two contexts, and plain text in a third:
111
-
112
- - **`body` field** (questions, answers, comments) — full HTML markup. Headings, code blocks, links, blockquotes, lists. Strip with `html.parser` for plain text.
113
- - **`title` field** — HTML-entity-encoded plain text. Quotes, angle brackets, and ampersands are escaped (`&quot;`, `&lt;`, `&amp;`). Decode with `html.unescape()`.
114
- - **`display_name`, `link`, `tags`** — plain text, no encoding.
115
-
116
- ```python
117
- import json, html
118
- from html.parser import HTMLParser
119
-
120
- data = json.loads(http_get("https://api.stackexchange.com/2.3/questions/231767?site=stackoverflow&filter=withbody"))
121
- q = data['items'][0]
122
-
123
- # Title has HTML entities
124
- print("Raw title:", q['title'])
125
- # 'What does the &quot;yield&quot; keyword do in Python?'
126
- print("Decoded:", html.unescape(q['title']))
127
- # 'What does the "yield" keyword do in Python?'
128
-
129
- # Body is full HTML — strip for plain text
130
- class Stripper(HTMLParser):
131
- def __init__(self):
132
- super().__init__()
133
- self.text = []
134
- def handle_data(self, d):
135
- self.text.append(d)
136
- def get_text(self):
137
- return ''.join(self.text)
138
-
139
- s = Stripper()
140
- s.feed(q['body'])
141
- print(s.get_text()[:200])
142
- # 'What functionality does the yield keyword do in Python?\nWhat is the ...'
143
- ```
144
-
145
- ---
146
-
147
- ## Common workflows
148
-
149
- ### Top questions by tag (API)
150
-
151
- ```python
152
- import json, html
153
- data = json.loads(http_get(
154
- "https://api.stackexchange.com/2.3/questions"
155
- "?order=desc&sort=votes&tagged=python&site=stackoverflow&pagesize=5&filter=withbody"
156
- ))
157
- for q in data['items']:
158
- print(q['question_id'], q['score'], html.unescape(q['title'])[:60])
159
- print(" Tags:", q['tags'][:3], "Answers:", q['answer_count'])
160
- print("Quota remaining:", data.get('quota_remaining'))
161
- # 231767 13133 What does the "yield" keyword do in Python?
162
- # Tags: ['python', 'iterator', 'generator'] Answers: 51
163
- # 419163 8438 What does if __name__ == "__main__": do?
164
- # Tags: ['python', 'namespaces', 'program-entry-point'] Answers: 40
165
- # Quota remaining: 299
166
- ```
167
-
168
- Sort options for `/questions`: `activity`, `votes`, `creation`, `hot`, `week`, `month`.
169
-
170
- ### Answers for a question
171
-
172
- ```python
173
- import json
174
- data = json.loads(http_get(
175
- "https://api.stackexchange.com/2.3/questions/231767/answers"
176
- "?order=desc&sort=votes&site=stackoverflow&filter=withbody&pagesize=3"
177
- ))
178
- for a in data['items']:
179
- print(f"Score: {a['score']}, Accepted: {a.get('is_accepted')}")
180
- print(f" Body preview: {a['body'][:150]}")
181
- # Score: 18307, Accepted: True
182
- # Body preview: <p>To understand what <a href="...">yield</a> does, ...
183
- # Score: 2596, Accepted: False
184
- # Score: 802, Accepted: False
185
- ```
186
-
187
- Answer fields (with `filter=withbody`): `answer_id`, `question_id`, `score`, `is_accepted`, `body`, `owner`, `creation_date`, `last_activity_date`, `content_license`.
188
-
189
- ### Fetch questions by ID (batch)
190
-
191
- Fetch up to 100 questions in one call using semicolons:
192
-
193
- ```python
194
- import json
195
- data = json.loads(http_get(
196
- "https://api.stackexchange.com/2.3/questions/231767;419163;394809"
197
- "?site=stackoverflow&filter=withbody"
198
- ))
199
- print("Fetched:", len(data['items'])) # 3
200
- for q in data['items']:
201
- print(q['question_id'], q['score'], q['title'][:50])
202
- # 231767 13133 What does the &quot;yield&quot; keyword do in Pyth
203
- # 419163 8438 What does if __name__ == &quot;__main__&quot;: do?
204
- # 394809 8125 Does Python have a ternary conditional operator?
205
- ```
206
-
207
- ### Search — `search/advanced` vs `search`
208
-
209
- Use `/search/advanced` when you need combined keyword + tag filtering. Use `/search` when searching only by title keyword (`intitle=`).
210
-
211
- ```python
212
- import json
213
-
214
- # search/advanced: keyword in body OR title, filtered by tag, sorted by relevance
215
- data = json.loads(http_get(
216
- "https://api.stackexchange.com/2.3/search/advanced"
217
- "?q=asyncio+event+loop&tagged=python&site=stackoverflow&pagesize=5&order=desc&sort=relevance"
218
- ))
219
- for q in data['items']:
220
- print(q['score'], q['answer_count'], q['title'][:70])
221
- # 137 3 "Asyncio Event Loop is Closed" when getting loop
222
- # 47 3 Can an asyncio event loop run in the background without suspending the
223
-
224
- # search: title-only keyword search via intitle=
225
- data = json.loads(http_get(
226
- "https://api.stackexchange.com/2.3/search"
227
- "?intitle=asyncio+event+loop&site=stackoverflow&pagesize=5&order=desc&sort=relevance"
228
- ))
229
- ```
230
-
231
- `search/advanced` additional params: `accepted=True` (only questions with accepted answers), `answers=1` (minimum answer count), `body=` (keyword in body), `user=` (filter by owner user ID), `views=` (minimum view count), `fromdate=`/`todate=` (Unix timestamps).
232
-
233
- ### User profile
234
-
235
- ```python
236
- import json
237
-
238
- # Basic user info
239
- user = json.loads(http_get("https://api.stackexchange.com/2.3/users/1?site=stackoverflow"))
240
- u = user['items'][0]
241
- print("User:", u['display_name'], "Rep:", u['reputation'], "Badges:", u['badge_counts'])
242
- # User: Jeff Atwood Rep: 64159 Badges: {'bronze': 153, 'silver': 153, 'gold': 48}
243
-
244
- # Fields: user_id, display_name, reputation, badge_counts, location, link,
245
- # creation_date, last_access_date, is_employee, account_id,
246
- # accept_rate, profile_image, website_url
247
-
248
- # Timeline (badge, question, answer events)
249
- data = json.loads(http_get("https://api.stackexchange.com/2.3/users/1/timeline?site=stackoverflow&pagesize=5"))
250
- print("Event types:", set(i['timeline_type'] for i in data['items']))
251
- # {'badge'}
252
-
253
- # User's top answers
254
- answers = json.loads(http_get("https://api.stackexchange.com/2.3/users/1/answers?site=stackoverflow&pagesize=5&order=desc&sort=votes"))
255
- for a in answers['items']:
256
- print("Score:", a['score'], "Question ID:", a.get('question_id'))
257
-
258
- # User's questions
259
- questions = json.loads(http_get("https://api.stackexchange.com/2.3/users/1/questions?site=stackoverflow&pagesize=3&order=desc&sort=votes"))
260
- for q in questions['items']:
261
- print(q['question_id'], q['score'], q['title'][:60])
262
- # 9 2273 How do I calculate someone&#39;s age based on a DateTime typ
263
- # 11 1656 Calculate relative time in C#
264
- ```
265
-
266
- ### Comments (requires `filter=withbody`)
267
-
268
- ```python
269
- import json
270
- data = json.loads(http_get(
271
- "https://api.stackexchange.com/2.3/questions/231767/comments"
272
- "?site=stackoverflow&pagesize=5&order=desc&sort=creation&filter=withbody"
273
- ))
274
- for c in data['items']:
275
- print("Score:", c['score'], "Body:", c.get('body','')[:80])
276
- # Comment keys (without filter): comment_id, content_license, creation_date,
277
- # edited, owner, post_id, reply_to_user, score
278
- # With filter=withbody: adds 'body' field (HTML-encoded)
279
- ```
280
-
281
- ### Related questions
282
-
283
- ```python
284
- import json
285
- related = json.loads(http_get(
286
- "https://api.stackexchange.com/2.3/questions/231767/related?site=stackoverflow&pagesize=5"
287
- ))
288
- for q in related['items']:
289
- print(q['question_id'], q['score'], q['title'][:60])
290
- # 25232350 15 how generators work in python
291
- # 28880095 11 What does a plain yield keyword do in Python?
292
- ```
293
-
294
- ### Popular tags
295
-
296
- ```python
297
- import json
298
- tags = json.loads(http_get("https://api.stackexchange.com/2.3/tags?order=desc&sort=popular&site=stackoverflow&pagesize=5"))
299
- for t in tags['items']:
300
- print(f"{t['name']}: {t['count']:,} questions")
301
- # javascript: 2,531,995 questions
302
- # java: 1,921,907 questions
303
- # c#: 1,626,728 questions
304
- # python: (check live — grows daily)
305
- ```
306
-
307
- ---
308
-
309
- ## Pagination
310
-
311
- Use `page=` (1-indexed) and `pagesize=` (max 100). Check `has_more` in the envelope to know whether a next page exists.
312
-
313
- ```python
314
- import json
315
-
316
- def fetch_all_pages(url_base, max_pages=5):
317
- """Fetch multiple pages from any Stack Exchange API endpoint."""
318
- results = []
319
- for page in range(1, max_pages + 1):
320
- data = json.loads(http_get(f"{url_base}&page={page}"))
321
- results.extend(data['items'])
322
- if not data.get('has_more'):
323
- break
324
- if data.get('backoff'):
325
- import time; time.sleep(data['backoff'])
326
- return results
327
-
328
- questions = fetch_all_pages(
329
- "https://api.stackexchange.com/2.3/questions?order=desc&sort=votes"
330
- "&tagged=python&site=stackoverflow&pagesize=10",
331
- max_pages=3
332
- )
333
- print("Total fetched:", len(questions)) # up to 30
334
- ```
335
-
336
- Note: `page=2` with `pagesize=3` returns the 4th–6th items. Confirmed working — `has_more: True` on page 2 of top Python questions.
337
-
338
- ---
339
-
340
- ## Parallel fetching (multiple questions or answers)
341
-
342
- ```python
343
- import json
344
- from concurrent.futures import ThreadPoolExecutor
345
-
346
- def fetch_top_answer(qid):
347
- data = json.loads(http_get(
348
- f"https://api.stackexchange.com/2.3/questions/{qid}/answers"
349
- "?order=desc&sort=votes&site=stackoverflow&filter=withbody&pagesize=1"
350
- ))
351
- if data['items']:
352
- a = data['items'][0]
353
- return {"qid": qid, "top_score": a['score'], "accepted": a.get('is_accepted')}
354
- return {"qid": qid, "top_score": 0}
355
-
356
- qids = [231767, 419163, 394809, 100003, 82831]
357
- with ThreadPoolExecutor(max_workers=3) as ex:
358
- results = list(ex.map(fetch_top_answer, qids))
359
-
360
- for r in results:
361
- print(r)
362
- # {'qid': 231767, 'top_score': 18307, 'accepted': True}
363
- # {'qid': 419163, 'top_score': 9051, 'accepted': True}
364
- # {'qid': 394809, 'top_score': 9355, 'accepted': True}
365
- # {'qid': 100003, 'top_score': 9334, 'accepted': False}
366
- # {'qid': 82831, 'top_score': 6793, 'accepted': False}
367
- ```
368
-
369
- Keep `max_workers` at 3 or below when unauthenticated — parallel calls consume quota simultaneously. At 3 workers, 5 questions used 5 quota units (expected).
370
-
371
- ---
372
-
373
- ## HTML page scraping (avoid for data tasks)
374
-
375
- The HTML page works but returns 777KB and has no clean `QAPage` JSON-LD. Use it only when you need something not in the API (e.g. rendered MathJax, ads context).
376
-
377
- ```python
378
- import re, html as htmllib
379
- headers = {"User-Agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36"}
380
- page = http_get("https://stackoverflow.com/questions/231767/what-does-the-yield-keyword-do-in-python", headers=headers)
381
- print("HTML length:", len(page)) # 777138
382
-
383
- # Page title (includes site suffix)
384
- title_m = re.search(r'<title>([^<]+)</title>', page)
385
- if title_m:
386
- print(htmllib.unescape(title_m.group(1)))
387
- # 'iterator - What does the "yield" keyword do in Python? - Stack Overflow'
388
-
389
- # Answer count via itemprop
390
- ans_count = re.search(r'itemprop="answerCount"[^>]*>(\d+)<', page)
391
- if ans_count:
392
- print("Answers:", ans_count.group(1)) # '51'
393
-
394
- # Score via itemprop (has whitespace around number)
395
- score_m = re.search(r'itemprop="upvoteCount"[^>]*>\s*(-?\d+)\s*<', page)
396
- if score_m:
397
- print("Score:", score_m.group(1)) # '13133'
398
-
399
- # JSON-LD is present but only has WebSite and Organization — NOT QAPage/Question
400
- ld_match = re.search(r'<script type="application/ld\+json">(.*?)</script>', page, re.DOTALL)
401
- if ld_match:
402
- d = json.loads(ld_match.group(1))
403
- types = [item.get('@type') for item in d.get('@graph', [])]
404
- print("JSON-LD types:", types) # ['WebSite', 'Organization'] — no QAPage
405
- ```
406
-
407
- ---
408
-
409
- ## Gotchas
410
-
411
- - **300 req/day unauthenticated is per IP, resets at midnight UTC.** 6 tests consumed ~27 quota units in one session. With parallel workers and loops, you can burn through 300 in minutes. Always check `quota_remaining` in responses.
412
-
413
- - **`filter=withbody` is required for body content.** Without it, `body` is simply absent from the response — no error, no empty string, just a missing key. Applies to questions, answers, AND comments.
414
-
415
- - **Title field has HTML entities, body field has full HTML markup.** They need different decoding strategies: `html.unescape()` for titles, `HTMLParser` stripping for bodies. Don't confuse them.
416
-
417
- - **Titles in API responses contain `&quot;`, `&lt;`, `&amp;`, `&#39;`** — raw output is `What does the &quot;yield&quot; keyword do in Python?`. Always call `html.unescape()` before displaying or comparing.
418
-
419
- - **Batch IDs with semicolons, not commas.** `/questions/231767;419163;394809` fetches 3 questions in one API call. Using commas returns a 400 error.
420
-
421
- - **`search/advanced` includes body text in results; `/search` only searches titles.** Use `search/advanced` with `q=` for full-text search. Use `/search` with `intitle=` for title-only.
422
-
423
- - **HTTP errors are raised as exceptions, not returned as JSON.** A bad `site=` param causes `urllib.error.HTTPError: HTTP Error 400: Bad Request` — there's no JSON body accessible from `http_get`. Wrap API calls in try/except.
424
-
425
- - **`backoff` in the response envelope must be respected.** If `data.get('backoff')` returns an integer (rare, typically 10–30 seconds), sleep that duration before the next call. Ignoring it will cause throttle errors on subsequent requests.
426
-
427
- - **`/info` endpoint wraps stats inside `items[0]`**, not directly in the envelope. Access as `data['items'][0]['total_questions']`.
428
-
429
- - **JSON-LD on the HTML page is NOT QAPage schema.** The `<script type="application/ld+json">` block only contains `WebSite` and `Organization` objects in the `@graph` array. There is no `Question`, `Answer`, or `QAPage` type — confirmed on the most-voted Python question (231767). Don't rely on structured data from the HTML page.
430
-
431
- - **User timeline `timeline_type` can be `badge`, `question`, `answer`, `comment`, `revision`, `suggested_edit`, `accepted`.** For very old/inactive users, all recent events may be `badge` only.
432
-
433
- - **Multi-site support.** Change `site=stackoverflow` to any Stack Exchange site: `site=superuser`, `site=serverfault`, `site=askubuntu`, `site=unix`, `site=datascience`, `site=math`. Same API, same quota pool per IP.
434
-
435
- - **`pagesize` max is 100.** Requesting more returns a 400 error. For bulk fetching, loop with `page=` and check `has_more`.
1
+ # Stack Overflow — Scraping & Data Extraction
2
+
3
+ `https://stackoverflow.com` — all public read-only data is available via the Stack Exchange API v2.3. No auth, no browser required for any read operation. API is fast, returns gzip-compressed JSON, and works transparently with `http_get`.
4
+
5
+ ## Do this first: pick your access path
6
+
7
+ | Goal | Best approach | Notes |
8
+ |------|--------------|-------|
9
+ | Top/hot questions by tag | `GET /2.3/questions` | Add `filter=withbody` for question text |
10
+ | Answers for a question | `GET /2.3/questions/{id}/answers` | Add `filter=withbody` for answer text |
11
+ | Search by keyword + tag | `GET /2.3/search/advanced` | More filters than `/search` |
12
+ | Simple title keyword search | `GET /2.3/search` | `intitle=` param |
13
+ | Fetch by known question IDs | `GET /2.3/questions/{id1};{id2};...` | Semicolon-delimited batch, up to 100 |
14
+ | User profile + reputation | `GET /2.3/users/{id}` | Public fields only |
15
+ | User activity timeline | `GET /2.3/users/{id}/timeline` | Events: badges, answers, questions |
16
+ | User's questions / answers | `GET /2.3/users/{id}/questions` or `/answers` | Standard listing |
17
+ | Comments on a post | `GET /2.3/questions/{id}/comments` | Needs `filter=withbody` for body |
18
+ | Related questions | `GET /2.3/questions/{id}/related` | Returns linked/similar questions |
19
+ | Answer by ID directly | `GET /2.3/answers/{id}` | One or more semicolon-separated IDs |
20
+ | Popular tags | `GET /2.3/tags` | Sort by `popular`, `activity`, or `name` |
21
+ | Site-wide statistics | `GET /2.3/info` | Total questions, quota, etc. |
22
+ | Question HTML page | `http_get` with User-Agent | Returns 777KB HTML; prefer API |
23
+
24
+ **Use the API for all data tasks.** The HTML page is 777KB, lacks clean structure, and the JSON-LD block only contains `WebSite` and `Organization` objects (no `QAPage` or `Question` schema). The API returns the same data in milliseconds, fully structured.
25
+
26
+ ---
27
+
28
+ ## Quota limits
29
+
30
+ The API is unauthenticated-friendly but strictly quota-capped per IP per day:
31
+
32
+ | Auth level | Daily quota | Burst |
33
+ |------------|-------------|-------|
34
+ | No key (unauthenticated) | **300 requests/day** | No enforced burst limit observed |
35
+ | With API key | **10,000 requests/day** | Same |
36
+
37
+ Check your remaining quota in every response envelope:
38
+
39
+ ```python
40
+ import json
41
+ data = json.loads(http_get("https://api.stackexchange.com/2.3/info?site=stackoverflow"))
42
+ print("Quota remaining:", data.get('quota_remaining')) # e.g. 273
43
+ print("Quota max:", data.get('quota_max')) # 300 unauthenticated, 10000 with key
44
+ # Confirmed: quota_max=300, quota_remaining decrements per call
45
+ ```
46
+
47
+ Every API response includes `quota_remaining` in the envelope. Monitor it. When it hits 0, all calls return HTTP 400 with `error_id: 502` (throttle_violation). There is no retry-after header — wait until midnight UTC.
48
+
49
+ **If you have an API key**, append `&key=YOUR_KEY` to any URL to use the 10,000/day quota.
50
+
51
+ ---
52
+
53
+ ## Response envelope
54
+
55
+ Every response from the Stack Exchange API is wrapped in a consistent envelope:
56
+
57
+ ```python
58
+ {
59
+ "items": [...], # list of result objects
60
+ "has_more": True/False, # whether more pages exist
61
+ "quota_max": 300, # total daily quota
62
+ "quota_remaining": 273, # calls left today
63
+ "backoff": None # seconds to wait before next call (rare)
64
+ }
65
+ ```
66
+
67
+ Always check `data.get('backoff')` — if it returns an integer, sleep that many seconds before the next call. Ignoring it causes throttle errors.
68
+
69
+ Error responses raise `urllib.error.HTTPError` (not a JSON envelope):
70
+ - HTTP 400 — invalid parameter (e.g. bad site name) — raises exception
71
+ - HTTP 400 with JSON body — quota exhausted or throttle_violation
72
+
73
+ ```python
74
+ try:
75
+ data = json.loads(http_get("https://api.stackexchange.com/2.3/questions?site=stackoverflow&pagesize=1"))
76
+ except Exception as e:
77
+ print("API error:", e) # HTTPError HTTP Error 400: Bad Request
78
+ ```
79
+
80
+ ---
81
+
82
+ ## `filter=withbody` — required for post content
83
+
84
+ By default, the API strips the `body` field from all responses. You **must** add `filter=withbody` to get question or answer text. This applies to questions, answers, and comments alike.
85
+
86
+ ```python
87
+ import json
88
+
89
+ # WITHOUT filter=withbody — body field is ABSENT
90
+ data = json.loads(http_get("https://api.stackexchange.com/2.3/questions?order=desc&sort=votes&tagged=python&site=stackoverflow&pagesize=1"))
91
+ q = data['items'][0]
92
+ print("Has body:", 'body' in q) # False
93
+ print("Keys:", sorted(q.keys()))
94
+ # ['accepted_answer_id', 'answer_count', 'content_license', 'creation_date',
95
+ # 'is_answered', 'last_activity_date', 'last_edit_date', 'link', 'owner',
96
+ # 'protected_date', 'question_id', 'score', 'tags', 'title', 'view_count']
97
+
98
+ # WITH filter=withbody — body field is PRESENT
99
+ data = json.loads(http_get("https://api.stackexchange.com/2.3/questions?order=desc&sort=votes&tagged=python&site=stackoverflow&pagesize=1&filter=withbody"))
100
+ q = data['items'][0]
101
+ print("Has body:", 'body' in q) # True
102
+ print("Body preview:", q['body'][:60])
103
+ # '<p>What functionality does the <a href="https://do...'
104
+ ```
105
+
106
+ ---
107
+
108
+ ## HTML encoding in API responses
109
+
110
+ The API returns HTML in two contexts, and plain text in a third:
111
+
112
+ - **`body` field** (questions, answers, comments) — full HTML markup. Headings, code blocks, links, blockquotes, lists. Strip with `html.parser` for plain text.
113
+ - **`title` field** — HTML-entity-encoded plain text. Quotes, angle brackets, and ampersands are escaped (`&quot;`, `&lt;`, `&amp;`). Decode with `html.unescape()`.
114
+ - **`display_name`, `link`, `tags`** — plain text, no encoding.
115
+
116
+ ```python
117
+ import json, html
118
+ from html.parser import HTMLParser
119
+
120
+ data = json.loads(http_get("https://api.stackexchange.com/2.3/questions/231767?site=stackoverflow&filter=withbody"))
121
+ q = data['items'][0]
122
+
123
+ # Title has HTML entities
124
+ print("Raw title:", q['title'])
125
+ # 'What does the &quot;yield&quot; keyword do in Python?'
126
+ print("Decoded:", html.unescape(q['title']))
127
+ # 'What does the "yield" keyword do in Python?'
128
+
129
+ # Body is full HTML — strip for plain text
130
+ class Stripper(HTMLParser):
131
+ def __init__(self):
132
+ super().__init__()
133
+ self.text = []
134
+ def handle_data(self, d):
135
+ self.text.append(d)
136
+ def get_text(self):
137
+ return ''.join(self.text)
138
+
139
+ s = Stripper()
140
+ s.feed(q['body'])
141
+ print(s.get_text()[:200])
142
+ # 'What functionality does the yield keyword do in Python?\nWhat is the ...'
143
+ ```
144
+
145
+ ---
146
+
147
+ ## Common workflows
148
+
149
+ ### Top questions by tag (API)
150
+
151
+ ```python
152
+ import json, html
153
+ data = json.loads(http_get(
154
+ "https://api.stackexchange.com/2.3/questions"
155
+ "?order=desc&sort=votes&tagged=python&site=stackoverflow&pagesize=5&filter=withbody"
156
+ ))
157
+ for q in data['items']:
158
+ print(q['question_id'], q['score'], html.unescape(q['title'])[:60])
159
+ print(" Tags:", q['tags'][:3], "Answers:", q['answer_count'])
160
+ print("Quota remaining:", data.get('quota_remaining'))
161
+ # 231767 13133 What does the "yield" keyword do in Python?
162
+ # Tags: ['python', 'iterator', 'generator'] Answers: 51
163
+ # 419163 8438 What does if __name__ == "__main__": do?
164
+ # Tags: ['python', 'namespaces', 'program-entry-point'] Answers: 40
165
+ # Quota remaining: 299
166
+ ```
167
+
168
+ Sort options for `/questions`: `activity`, `votes`, `creation`, `hot`, `week`, `month`.
169
+
170
+ ### Answers for a question
171
+
172
+ ```python
173
+ import json
174
+ data = json.loads(http_get(
175
+ "https://api.stackexchange.com/2.3/questions/231767/answers"
176
+ "?order=desc&sort=votes&site=stackoverflow&filter=withbody&pagesize=3"
177
+ ))
178
+ for a in data['items']:
179
+ print(f"Score: {a['score']}, Accepted: {a.get('is_accepted')}")
180
+ print(f" Body preview: {a['body'][:150]}")
181
+ # Score: 18307, Accepted: True
182
+ # Body preview: <p>To understand what <a href="...">yield</a> does, ...
183
+ # Score: 2596, Accepted: False
184
+ # Score: 802, Accepted: False
185
+ ```
186
+
187
+ Answer fields (with `filter=withbody`): `answer_id`, `question_id`, `score`, `is_accepted`, `body`, `owner`, `creation_date`, `last_activity_date`, `content_license`.
188
+
189
+ ### Fetch questions by ID (batch)
190
+
191
+ Fetch up to 100 questions in one call using semicolons:
192
+
193
+ ```python
194
+ import json
195
+ data = json.loads(http_get(
196
+ "https://api.stackexchange.com/2.3/questions/231767;419163;394809"
197
+ "?site=stackoverflow&filter=withbody"
198
+ ))
199
+ print("Fetched:", len(data['items'])) # 3
200
+ for q in data['items']:
201
+ print(q['question_id'], q['score'], q['title'][:50])
202
+ # 231767 13133 What does the &quot;yield&quot; keyword do in Pyth
203
+ # 419163 8438 What does if __name__ == &quot;__main__&quot;: do?
204
+ # 394809 8125 Does Python have a ternary conditional operator?
205
+ ```
206
+
207
+ ### Search — `search/advanced` vs `search`
208
+
209
+ Use `/search/advanced` when you need combined keyword + tag filtering. Use `/search` when searching only by title keyword (`intitle=`).
210
+
211
+ ```python
212
+ import json
213
+
214
+ # search/advanced: keyword in body OR title, filtered by tag, sorted by relevance
215
+ data = json.loads(http_get(
216
+ "https://api.stackexchange.com/2.3/search/advanced"
217
+ "?q=asyncio+event+loop&tagged=python&site=stackoverflow&pagesize=5&order=desc&sort=relevance"
218
+ ))
219
+ for q in data['items']:
220
+ print(q['score'], q['answer_count'], q['title'][:70])
221
+ # 137 3 "Asyncio Event Loop is Closed" when getting loop
222
+ # 47 3 Can an asyncio event loop run in the background without suspending the
223
+
224
+ # search: title-only keyword search via intitle=
225
+ data = json.loads(http_get(
226
+ "https://api.stackexchange.com/2.3/search"
227
+ "?intitle=asyncio+event+loop&site=stackoverflow&pagesize=5&order=desc&sort=relevance"
228
+ ))
229
+ ```
230
+
231
+ `search/advanced` additional params: `accepted=True` (only questions with accepted answers), `answers=1` (minimum answer count), `body=` (keyword in body), `user=` (filter by owner user ID), `views=` (minimum view count), `fromdate=`/`todate=` (Unix timestamps).
232
+
233
+ ### User profile
234
+
235
+ ```python
236
+ import json
237
+
238
+ # Basic user info
239
+ user = json.loads(http_get("https://api.stackexchange.com/2.3/users/1?site=stackoverflow"))
240
+ u = user['items'][0]
241
+ print("User:", u['display_name'], "Rep:", u['reputation'], "Badges:", u['badge_counts'])
242
+ # User: Jeff Atwood Rep: 64159 Badges: {'bronze': 153, 'silver': 153, 'gold': 48}
243
+
244
+ # Fields: user_id, display_name, reputation, badge_counts, location, link,
245
+ # creation_date, last_access_date, is_employee, account_id,
246
+ # accept_rate, profile_image, website_url
247
+
248
+ # Timeline (badge, question, answer events)
249
+ data = json.loads(http_get("https://api.stackexchange.com/2.3/users/1/timeline?site=stackoverflow&pagesize=5"))
250
+ print("Event types:", set(i['timeline_type'] for i in data['items']))
251
+ # {'badge'}
252
+
253
+ # User's top answers
254
+ answers = json.loads(http_get("https://api.stackexchange.com/2.3/users/1/answers?site=stackoverflow&pagesize=5&order=desc&sort=votes"))
255
+ for a in answers['items']:
256
+ print("Score:", a['score'], "Question ID:", a.get('question_id'))
257
+
258
+ # User's questions
259
+ questions = json.loads(http_get("https://api.stackexchange.com/2.3/users/1/questions?site=stackoverflow&pagesize=3&order=desc&sort=votes"))
260
+ for q in questions['items']:
261
+ print(q['question_id'], q['score'], q['title'][:60])
262
+ # 9 2273 How do I calculate someone&#39;s age based on a DateTime typ
263
+ # 11 1656 Calculate relative time in C#
264
+ ```
265
+
266
+ ### Comments (requires `filter=withbody`)
267
+
268
+ ```python
269
+ import json
270
+ data = json.loads(http_get(
271
+ "https://api.stackexchange.com/2.3/questions/231767/comments"
272
+ "?site=stackoverflow&pagesize=5&order=desc&sort=creation&filter=withbody"
273
+ ))
274
+ for c in data['items']:
275
+ print("Score:", c['score'], "Body:", c.get('body','')[:80])
276
+ # Comment keys (without filter): comment_id, content_license, creation_date,
277
+ # edited, owner, post_id, reply_to_user, score
278
+ # With filter=withbody: adds 'body' field (HTML-encoded)
279
+ ```
280
+
281
+ ### Related questions
282
+
283
+ ```python
284
+ import json
285
+ related = json.loads(http_get(
286
+ "https://api.stackexchange.com/2.3/questions/231767/related?site=stackoverflow&pagesize=5"
287
+ ))
288
+ for q in related['items']:
289
+ print(q['question_id'], q['score'], q['title'][:60])
290
+ # 25232350 15 how generators work in python
291
+ # 28880095 11 What does a plain yield keyword do in Python?
292
+ ```
293
+
294
+ ### Popular tags
295
+
296
+ ```python
297
+ import json
298
+ tags = json.loads(http_get("https://api.stackexchange.com/2.3/tags?order=desc&sort=popular&site=stackoverflow&pagesize=5"))
299
+ for t in tags['items']:
300
+ print(f"{t['name']}: {t['count']:,} questions")
301
+ # javascript: 2,531,995 questions
302
+ # java: 1,921,907 questions
303
+ # c#: 1,626,728 questions
304
+ # python: (check live — grows daily)
305
+ ```
306
+
307
+ ---
308
+
309
+ ## Pagination
310
+
311
+ Use `page=` (1-indexed) and `pagesize=` (max 100). Check `has_more` in the envelope to know whether a next page exists.
312
+
313
+ ```python
314
+ import json
315
+
316
+ def fetch_all_pages(url_base, max_pages=5):
317
+ """Fetch multiple pages from any Stack Exchange API endpoint."""
318
+ results = []
319
+ for page in range(1, max_pages + 1):
320
+ data = json.loads(http_get(f"{url_base}&page={page}"))
321
+ results.extend(data['items'])
322
+ if not data.get('has_more'):
323
+ break
324
+ if data.get('backoff'):
325
+ import time; time.sleep(data['backoff'])
326
+ return results
327
+
328
+ questions = fetch_all_pages(
329
+ "https://api.stackexchange.com/2.3/questions?order=desc&sort=votes"
330
+ "&tagged=python&site=stackoverflow&pagesize=10",
331
+ max_pages=3
332
+ )
333
+ print("Total fetched:", len(questions)) # up to 30
334
+ ```
335
+
336
+ Note: `page=2` with `pagesize=3` returns the 4th–6th items. Confirmed working — `has_more: True` on page 2 of top Python questions.
337
+
338
+ ---
339
+
340
+ ## Parallel fetching (multiple questions or answers)
341
+
342
+ ```python
343
+ import json
344
+ from concurrent.futures import ThreadPoolExecutor
345
+
346
+ def fetch_top_answer(qid):
347
+ data = json.loads(http_get(
348
+ f"https://api.stackexchange.com/2.3/questions/{qid}/answers"
349
+ "?order=desc&sort=votes&site=stackoverflow&filter=withbody&pagesize=1"
350
+ ))
351
+ if data['items']:
352
+ a = data['items'][0]
353
+ return {"qid": qid, "top_score": a['score'], "accepted": a.get('is_accepted')}
354
+ return {"qid": qid, "top_score": 0}
355
+
356
+ qids = [231767, 419163, 394809, 100003, 82831]
357
+ with ThreadPoolExecutor(max_workers=3) as ex:
358
+ results = list(ex.map(fetch_top_answer, qids))
359
+
360
+ for r in results:
361
+ print(r)
362
+ # {'qid': 231767, 'top_score': 18307, 'accepted': True}
363
+ # {'qid': 419163, 'top_score': 9051, 'accepted': True}
364
+ # {'qid': 394809, 'top_score': 9355, 'accepted': True}
365
+ # {'qid': 100003, 'top_score': 9334, 'accepted': False}
366
+ # {'qid': 82831, 'top_score': 6793, 'accepted': False}
367
+ ```
368
+
369
+ Keep `max_workers` at 3 or below when unauthenticated — parallel calls consume quota simultaneously. At 3 workers, 5 questions used 5 quota units (expected).
370
+
371
+ ---
372
+
373
+ ## HTML page scraping (avoid for data tasks)
374
+
375
+ The HTML page works but returns 777KB and has no clean `QAPage` JSON-LD. Use it only when you need something not in the API (e.g. rendered MathJax, ads context).
376
+
377
+ ```python
378
+ import re, html as htmllib
379
+ headers = {"User-Agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36"}
380
+ page = http_get("https://stackoverflow.com/questions/231767/what-does-the-yield-keyword-do-in-python", headers=headers)
381
+ print("HTML length:", len(page)) # 777138
382
+
383
+ # Page title (includes site suffix)
384
+ title_m = re.search(r'<title>([^<]+)</title>', page)
385
+ if title_m:
386
+ print(htmllib.unescape(title_m.group(1)))
387
+ # 'iterator - What does the "yield" keyword do in Python? - Stack Overflow'
388
+
389
+ # Answer count via itemprop
390
+ ans_count = re.search(r'itemprop="answerCount"[^>]*>(\d+)<', page)
391
+ if ans_count:
392
+ print("Answers:", ans_count.group(1)) # '51'
393
+
394
+ # Score via itemprop (has whitespace around number)
395
+ score_m = re.search(r'itemprop="upvoteCount"[^>]*>\s*(-?\d+)\s*<', page)
396
+ if score_m:
397
+ print("Score:", score_m.group(1)) # '13133'
398
+
399
+ # JSON-LD is present but only has WebSite and Organization — NOT QAPage/Question
400
+ ld_match = re.search(r'<script type="application/ld\+json">(.*?)</script>', page, re.DOTALL)
401
+ if ld_match:
402
+ d = json.loads(ld_match.group(1))
403
+ types = [item.get('@type') for item in d.get('@graph', [])]
404
+ print("JSON-LD types:", types) # ['WebSite', 'Organization'] — no QAPage
405
+ ```
406
+
407
+ ---
408
+
409
+ ## Gotchas
410
+
411
+ - **300 req/day unauthenticated is per IP, resets at midnight UTC.** 6 tests consumed ~27 quota units in one session. With parallel workers and loops, you can burn through 300 in minutes. Always check `quota_remaining` in responses.
412
+
413
+ - **`filter=withbody` is required for body content.** Without it, `body` is simply absent from the response — no error, no empty string, just a missing key. Applies to questions, answers, AND comments.
414
+
415
+ - **Title field has HTML entities, body field has full HTML markup.** They need different decoding strategies: `html.unescape()` for titles, `HTMLParser` stripping for bodies. Don't confuse them.
416
+
417
+ - **Titles in API responses contain `&quot;`, `&lt;`, `&amp;`, `&#39;`** — raw output is `What does the &quot;yield&quot; keyword do in Python?`. Always call `html.unescape()` before displaying or comparing.
418
+
419
+ - **Batch IDs with semicolons, not commas.** `/questions/231767;419163;394809` fetches 3 questions in one API call. Using commas returns a 400 error.
420
+
421
+ - **`search/advanced` includes body text in results; `/search` only searches titles.** Use `search/advanced` with `q=` for full-text search. Use `/search` with `intitle=` for title-only.
422
+
423
+ - **HTTP errors are raised as exceptions, not returned as JSON.** A bad `site=` param causes `urllib.error.HTTPError: HTTP Error 400: Bad Request` — there's no JSON body accessible from `http_get`. Wrap API calls in try/except.
424
+
425
+ - **`backoff` in the response envelope must be respected.** If `data.get('backoff')` returns an integer (rare, typically 10–30 seconds), sleep that duration before the next call. Ignoring it will cause throttle errors on subsequent requests.
426
+
427
+ - **`/info` endpoint wraps stats inside `items[0]`**, not directly in the envelope. Access as `data['items'][0]['total_questions']`.
428
+
429
+ - **JSON-LD on the HTML page is NOT QAPage schema.** The `<script type="application/ld+json">` block only contains `WebSite` and `Organization` objects in the `@graph` array. There is no `Question`, `Answer`, or `QAPage` type — confirmed on the most-voted Python question (231767). Don't rely on structured data from the HTML page.
430
+
431
+ - **User timeline `timeline_type` can be `badge`, `question`, `answer`, `comment`, `revision`, `suggested_edit`, `accepted`.** For very old/inactive users, all recent events may be `badge` only.
432
+
433
+ - **Multi-site support.** Change `site=stackoverflow` to any Stack Exchange site: `site=superuser`, `site=serverfault`, `site=askubuntu`, `site=unix`, `site=datascience`, `site=math`. Same API, same quota pool per IP.
434
+
435
+ - **`pagesize` max is 100.** Requesting more returns a 400 error. For bulk fetching, loop with `page=` and check `has_more`.