@pencil-agent/nano-pencil 2.0.0-beta.9 → 2.0.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
Files changed (207) hide show
  1. package/README.md +267 -267
  2. package/dist/build-meta.json +3 -3
  3. package/dist/core/export-html/AGENT.md +11 -11
  4. package/dist/core/export-html/template.css +971 -971
  5. package/dist/core/export-html/template.html +54 -54
  6. package/dist/core/extensions-host/index.d.ts +1 -1
  7. package/dist/core/extensions-host/types.d.ts +5 -8
  8. package/dist/extensions/builtin/AGENT.md +115 -115
  9. package/dist/extensions/builtin/browser/AGENT.md +17 -17
  10. package/dist/extensions/builtin/browser/agent-workspace/agent_helpers.py +12 -12
  11. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/amazon/product-search.md +198 -198
  12. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/archive-org/scraping.md +341 -341
  13. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/arxiv/scraping.md +311 -311
  14. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/arxiv-bulk/scraping.md +333 -333
  15. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/atlas/overview.md +70 -70
  16. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/booking-com/scraping.md +578 -578
  17. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/capterra/scraping.md +440 -440
  18. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/centilebrain/generate-estimates.md +110 -110
  19. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/coingecko/scraping.md +325 -325
  20. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/coinmarketcap/scraping.md +463 -463
  21. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/coursera/scraping.md +360 -360
  22. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/craigslist/scraping.md +390 -390
  23. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/crossref/scraping.md +568 -568
  24. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/dev-to/scraping.md +323 -323
  25. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/duckduckgo/scraping.md +349 -349
  26. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/ebay/scraping.md +435 -435
  27. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/etsy/scraping.md +506 -506
  28. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/eventbrite/scraping.md +363 -363
  29. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/expedia/automation.md +168 -168
  30. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/facebook/groups.md +236 -236
  31. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/facebook/pages.md +295 -295
  32. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/framer/editor.md +108 -108
  33. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/fred/scraping.md +493 -493
  34. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/g2/scraping.md +580 -580
  35. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/genius/scraping.md +511 -511
  36. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/github/repo-actions.md +65 -65
  37. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/github/scraping.md +184 -184
  38. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/glassdoor/scraping.md +543 -543
  39. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/gmail/compose.md +122 -122
  40. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/goodreads/scraping.md +461 -461
  41. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/gutenberg/scraping.md +383 -383
  42. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/hackernews/scraping.md +243 -243
  43. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/howlongtobeat/scraping.md +473 -473
  44. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/imdb/scraping.md +271 -271
  45. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/itch-io/scraping.md +436 -436
  46. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/job-boards/indeed-glassdoor.md +1021 -1021
  47. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/letterboxd/scraping.md +349 -349
  48. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/linkedin/invitation-manager.md +109 -109
  49. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/loom/folder-enumeration.md +170 -170
  50. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/macrotrends/scraping.md +537 -537
  51. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/medium/article-hydration.md +120 -120
  52. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/medium/scraping.md +414 -414
  53. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/metacritic/scraping.md +477 -477
  54. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/musicbrainz/scraping.md +478 -478
  55. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/nasa/scraping.md +339 -339
  56. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/news-aggregation/multi-source.md +205 -205
  57. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/open-library/scraping.md +472 -472
  58. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/openalex/scraping.md +470 -470
  59. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/openstreetmap/scraping.md +490 -490
  60. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/package-registries/npm-pypi.md +478 -478
  61. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/polymarket/scraping.md +234 -234
  62. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/producthunt/scraping.md +307 -307
  63. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/pubmed/scraping.md +421 -421
  64. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/quora/scraping.md +364 -364
  65. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/rawg/scraping.md +352 -352
  66. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/reddit/scraping.md +124 -124
  67. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/rest-countries/scraping.md +233 -233
  68. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/sec-edgar/scraping.md +361 -361
  69. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/shopify-admin/README.md +36 -36
  70. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/shopify-admin/embedded-apps.md +72 -72
  71. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/shopify-admin/knowledge-base.md +109 -109
  72. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/shopify-admin/polaris-inputs.md +137 -137
  73. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/soundcloud/scraping.md +362 -362
  74. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/spotify/scraping.md +339 -339
  75. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/stackoverflow/scraping.md +435 -435
  76. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/steam/scraping.md +575 -575
  77. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/substack/scraping.md +338 -338
  78. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/thetechgeeks/pricing.md +52 -52
  79. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/tiktok/upload.md +107 -107
  80. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/tradingview/scraping.md +309 -309
  81. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/trello/boards-and-lists.md +88 -88
  82. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/trustpilot/scraping.md +375 -375
  83. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/walmart/scraping.md +444 -444
  84. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/wayback-machine/scraping.md +306 -306
  85. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/weather/scraping.md +398 -398
  86. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/wellfound/scraping.md +596 -596
  87. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/world-bank/scraping.md +356 -356
  88. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/xiaohongshu/scraping.md +84 -84
  89. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/youtube/scraping.md +418 -418
  90. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/zillow/scraping.md +433 -433
  91. package/dist/extensions/builtin/browser/browser.md +73 -73
  92. package/dist/extensions/builtin/browser/install.md +142 -142
  93. package/dist/extensions/builtin/browser/interaction-skills/connection.md +48 -48
  94. package/dist/extensions/builtin/browser/interaction-skills/cookies.md +3 -3
  95. package/dist/extensions/builtin/browser/interaction-skills/cross-origin-iframes.md +3 -3
  96. package/dist/extensions/builtin/browser/interaction-skills/dialogs.md +64 -64
  97. package/dist/extensions/builtin/browser/interaction-skills/downloads.md +3 -3
  98. package/dist/extensions/builtin/browser/interaction-skills/drag-and-drop.md +3 -3
  99. package/dist/extensions/builtin/browser/interaction-skills/dropdowns.md +3 -3
  100. package/dist/extensions/builtin/browser/interaction-skills/iframes.md +3 -3
  101. package/dist/extensions/builtin/browser/interaction-skills/network-requests.md +3 -3
  102. package/dist/extensions/builtin/browser/interaction-skills/print-as-pdf.md +3 -3
  103. package/dist/extensions/builtin/browser/interaction-skills/profile-sync.md +90 -90
  104. package/dist/extensions/builtin/browser/interaction-skills/screenshots.md +17 -17
  105. package/dist/extensions/builtin/browser/interaction-skills/scrolling.md +3 -3
  106. package/dist/extensions/builtin/browser/interaction-skills/shadow-dom.md +3 -3
  107. package/dist/extensions/builtin/browser/interaction-skills/tabs.md +69 -69
  108. package/dist/extensions/builtin/browser/interaction-skills/uploads.md +1 -1
  109. package/dist/extensions/builtin/browser/interaction-skills/viewport.md +3 -3
  110. package/dist/extensions/builtin/browser/src/browser_harness/AGENT.md +15 -15
  111. package/dist/extensions/builtin/browser/src/browser_harness/__init__.py +8 -8
  112. package/dist/extensions/builtin/browser/src/browser_harness/_ipc.py +90 -90
  113. package/dist/extensions/builtin/browser/src/browser_harness/admin.py +722 -722
  114. package/dist/extensions/builtin/browser/src/browser_harness/daemon.py +328 -328
  115. package/dist/extensions/builtin/browser/src/browser_harness/helpers.py +396 -396
  116. package/dist/extensions/builtin/browser/src/browser_harness/run.py +103 -103
  117. package/dist/extensions/builtin/discipline/skills/brainstorming/SKILL.md +33 -33
  118. package/dist/extensions/builtin/discipline/skills/executing-plans/SKILL.md +25 -25
  119. package/dist/extensions/builtin/discipline/skills/finishing-development-branch/SKILL.md +25 -25
  120. package/dist/extensions/builtin/discipline/skills/receiving-code-review/SKILL.md +22 -22
  121. package/dist/extensions/builtin/discipline/skills/requesting-code-review/SKILL.md +31 -31
  122. package/dist/extensions/builtin/discipline/skills/systematic-debugging/SKILL.md +28 -28
  123. package/dist/extensions/builtin/discipline/skills/test-driven-development/SKILL.md +32 -32
  124. package/dist/extensions/builtin/discipline/skills/using-git-worktrees/SKILL.md +25 -25
  125. package/dist/extensions/builtin/discipline/skills/verification-before-completion/SKILL.md +27 -27
  126. package/dist/extensions/builtin/discipline/skills/writing-plans/SKILL.md +26 -26
  127. package/dist/extensions/builtin/goal/README.md +67 -67
  128. package/dist/extensions/builtin/goal/goal-controller.js +1 -1
  129. package/dist/extensions/builtin/goal/goal-prompts.js +4 -4
  130. package/dist/extensions/builtin/grub/README.md +112 -112
  131. package/dist/extensions/builtin/link-world/agent-workspace/README.md +16 -16
  132. package/dist/extensions/builtin/link-world/internet-search/internet-search.md +65 -65
  133. package/dist/extensions/builtin/link-world/link-world-agent.md +82 -82
  134. package/dist/extensions/builtin/link-world/linkworld.md +313 -313
  135. package/dist/extensions/builtin/link-world/network-routing/network-routing.md +67 -67
  136. package/dist/extensions/builtin/loop/README.md +92 -92
  137. package/dist/extensions/builtin/mcp/figma-design.md +68 -68
  138. package/dist/extensions/builtin/mcp/mcp-management.md +85 -85
  139. package/dist/extensions/builtin/recap/AGENT.md +15 -15
  140. package/dist/extensions/builtin/sal/README.md +72 -72
  141. package/dist/extensions/builtin/security-audit/README.md +289 -289
  142. package/dist/extensions/builtin/team/AGENT.md +112 -112
  143. package/dist/extensions/builtin/team/TESTING.md +299 -299
  144. package/dist/extensions/builtin/token-save/README.md +56 -56
  145. package/dist/extensions/optional/AGENT.md +10 -10
  146. package/dist/index.d.ts +5 -30
  147. package/dist/index.js +1 -1
  148. package/dist/models.d.ts +7 -0
  149. package/dist/models.js +1 -0
  150. package/dist/modes/interactive/theme/dark.json +85 -85
  151. package/dist/modes/interactive/theme/light.json +84 -84
  152. package/dist/modes/interactive/theme/theme-schema.json +335 -335
  153. package/dist/modes/interactive/theme/warm.json +81 -81
  154. package/dist/node_modules/@pencil-agent/ai/dist/cli.js +0 -0
  155. package/dist/packages/protocol/src/flags.d.ts +20 -0
  156. package/dist/packages/protocol/src/flags.js +0 -0
  157. package/dist/packages/protocol/src/hooks.d.ts +17 -0
  158. package/dist/packages/protocol/src/hooks.js +0 -0
  159. package/dist/packages/protocol/src/index.d.ts +4 -2
  160. package/dist/packages/protocol/src/index.js +1 -1
  161. package/dist/packages/protocol/src/lifecycle.d.ts +11 -21
  162. package/dist/public-config.d.ts +12 -0
  163. package/dist/public-config.js +1 -0
  164. package/dist/runtime.d.ts +9 -0
  165. package/dist/runtime.js +1 -0
  166. package/dist/session-compaction.d.ts +7 -0
  167. package/dist/session-compaction.js +1 -0
  168. package/dist/session.d.ts +7 -0
  169. package/dist/session.js +1 -0
  170. package/dist/skills.d.ts +7 -0
  171. package/dist/skills.js +1 -0
  172. package/dist/tools.d.ts +7 -0
  173. package/dist/tools.js +1 -0
  174. package/docs/ACP/345/215/217/350/256/256/351/233/206/346/210/220/345/274/200/345/217/221/346/226/207/346/241/243.md +851 -0
  175. package/docs/SDK-TESTING.md +364 -0
  176. package/docs/codex-goal-command-impl.md +1055 -1055
  177. package/docs/codex-goal-vs-grub.md +500 -500
  178. package/docs/custom-provider.md +27 -27
  179. package/docs/extensions.md +27 -27
  180. package/docs/keybindings.md +27 -27
  181. package/docs/loop /351/207/215/346/236/204/345/256/214/346/210/220/346/200/273/347/273/223.md" +250 -250
  182. package/docs/loop /351/207/215/346/236/204/345/256/214/346/210/220/346/212/245/345/221/212.md" +122 -122
  183. package/docs/loop /351/207/215/346/236/204/346/226/271/346/241/210.md" +1222 -1222
  184. package/docs/loop /351/207/215/346/236/204/346/226/271/346/241/210/345/256/236/347/216/260/346/212/245/345/221/212.md" +158 -158
  185. package/docs/loop /351/207/215/346/236/204/346/226/271/346/241/210/345/257/271/346/257/224/345/210/206/346/236/220.md" +128 -128
  186. package/docs/loop /351/207/215/346/236/204/350/256/241/345/210/222.md" +320 -320
  187. package/docs/loop-usage-examples.md +214 -214
  188. package/docs/mem-core/346/212/200/346/234/257/346/226/207/346/241/243.md +593 -0
  189. package/docs/models.md +27 -27
  190. package/docs/packages.md +27 -27
  191. package/docs/pi-design-philosophy.md +457 -457
  192. package/docs/planmode.md +1987 -1987
  193. package/docs/prompt-templates.md +27 -27
  194. package/docs/providers.md +27 -27
  195. package/docs/sdk.md +27 -27
  196. package/docs/skills.md +27 -27
  197. package/docs/startup-performance-optimization.md +301 -0
  198. package/docs/themes.md +27 -27
  199. package/docs/tui.md +27 -27
  200. package/docs//350/256/244/347/237/245/345/234/260/345/233/276.md +47 -0
  201. package/package.json +190 -162
  202. package/docs/cc-agent-design.md +0 -1297
  203. package/docs/cc-tui-design.md +0 -1333
  204. package/docs/nanoPencil-/345/255/246/344/271/240/350/256/241/345/210/222.md +0 -170
  205. package/docs/scan-report.md +0 -3820
  206. package/docs//345/257/271/346/240/207Claude-Code.md +0 -1775
  207. package/docs//351/230/277/351/207/214/345/267/264/345/267/264/350/264/242/346/212/245/345/210/206/346/236/220/344/271/246.md +0 -261
@@ -1,435 +1,435 @@
1
- # Stack Overflow — Scraping & Data Extraction
2
-
3
- `https://stackoverflow.com` — all public read-only data is available via the Stack Exchange API v2.3. No auth, no browser required for any read operation. API is fast, returns gzip-compressed JSON, and works transparently with `http_get`.
4
-
5
- ## Do this first: pick your access path
6
-
7
- | Goal | Best approach | Notes |
8
- |------|--------------|-------|
9
- | Top/hot questions by tag | `GET /2.3/questions` | Add `filter=withbody` for question text |
10
- | Answers for a question | `GET /2.3/questions/{id}/answers` | Add `filter=withbody` for answer text |
11
- | Search by keyword + tag | `GET /2.3/search/advanced` | More filters than `/search` |
12
- | Simple title keyword search | `GET /2.3/search` | `intitle=` param |
13
- | Fetch by known question IDs | `GET /2.3/questions/{id1};{id2};...` | Semicolon-delimited batch, up to 100 |
14
- | User profile + reputation | `GET /2.3/users/{id}` | Public fields only |
15
- | User activity timeline | `GET /2.3/users/{id}/timeline` | Events: badges, answers, questions |
16
- | User's questions / answers | `GET /2.3/users/{id}/questions` or `/answers` | Standard listing |
17
- | Comments on a post | `GET /2.3/questions/{id}/comments` | Needs `filter=withbody` for body |
18
- | Related questions | `GET /2.3/questions/{id}/related` | Returns linked/similar questions |
19
- | Answer by ID directly | `GET /2.3/answers/{id}` | One or more semicolon-separated IDs |
20
- | Popular tags | `GET /2.3/tags` | Sort by `popular`, `activity`, or `name` |
21
- | Site-wide statistics | `GET /2.3/info` | Total questions, quota, etc. |
22
- | Question HTML page | `http_get` with User-Agent | Returns 777KB HTML; prefer API |
23
-
24
- **Use the API for all data tasks.** The HTML page is 777KB, lacks clean structure, and the JSON-LD block only contains `WebSite` and `Organization` objects (no `QAPage` or `Question` schema). The API returns the same data in milliseconds, fully structured.
25
-
26
- ---
27
-
28
- ## Quota limits
29
-
30
- The API is unauthenticated-friendly but strictly quota-capped per IP per day:
31
-
32
- | Auth level | Daily quota | Burst |
33
- |------------|-------------|-------|
34
- | No key (unauthenticated) | **300 requests/day** | No enforced burst limit observed |
35
- | With API key | **10,000 requests/day** | Same |
36
-
37
- Check your remaining quota in every response envelope:
38
-
39
- ```python
40
- import json
41
- data = json.loads(http_get("https://api.stackexchange.com/2.3/info?site=stackoverflow"))
42
- print("Quota remaining:", data.get('quota_remaining')) # e.g. 273
43
- print("Quota max:", data.get('quota_max')) # 300 unauthenticated, 10000 with key
44
- # Confirmed: quota_max=300, quota_remaining decrements per call
45
- ```
46
-
47
- Every API response includes `quota_remaining` in the envelope. Monitor it. When it hits 0, all calls return HTTP 400 with `error_id: 502` (throttle_violation). There is no retry-after header — wait until midnight UTC.
48
-
49
- **If you have an API key**, append `&key=YOUR_KEY` to any URL to use the 10,000/day quota.
50
-
51
- ---
52
-
53
- ## Response envelope
54
-
55
- Every response from the Stack Exchange API is wrapped in a consistent envelope:
56
-
57
- ```python
58
- {
59
- "items": [...], # list of result objects
60
- "has_more": True/False, # whether more pages exist
61
- "quota_max": 300, # total daily quota
62
- "quota_remaining": 273, # calls left today
63
- "backoff": None # seconds to wait before next call (rare)
64
- }
65
- ```
66
-
67
- Always check `data.get('backoff')` — if it returns an integer, sleep that many seconds before the next call. Ignoring it causes throttle errors.
68
-
69
- Error responses raise `urllib.error.HTTPError` (not a JSON envelope):
70
- - HTTP 400 — invalid parameter (e.g. bad site name) — raises exception
71
- - HTTP 400 with JSON body — quota exhausted or throttle_violation
72
-
73
- ```python
74
- try:
75
- data = json.loads(http_get("https://api.stackexchange.com/2.3/questions?site=stackoverflow&pagesize=1"))
76
- except Exception as e:
77
- print("API error:", e) # HTTPError HTTP Error 400: Bad Request
78
- ```
79
-
80
- ---
81
-
82
- ## `filter=withbody` — required for post content
83
-
84
- By default, the API strips the `body` field from all responses. You **must** add `filter=withbody` to get question or answer text. This applies to questions, answers, and comments alike.
85
-
86
- ```python
87
- import json
88
-
89
- # WITHOUT filter=withbody — body field is ABSENT
90
- data = json.loads(http_get("https://api.stackexchange.com/2.3/questions?order=desc&sort=votes&tagged=python&site=stackoverflow&pagesize=1"))
91
- q = data['items'][0]
92
- print("Has body:", 'body' in q) # False
93
- print("Keys:", sorted(q.keys()))
94
- # ['accepted_answer_id', 'answer_count', 'content_license', 'creation_date',
95
- # 'is_answered', 'last_activity_date', 'last_edit_date', 'link', 'owner',
96
- # 'protected_date', 'question_id', 'score', 'tags', 'title', 'view_count']
97
-
98
- # WITH filter=withbody — body field is PRESENT
99
- data = json.loads(http_get("https://api.stackexchange.com/2.3/questions?order=desc&sort=votes&tagged=python&site=stackoverflow&pagesize=1&filter=withbody"))
100
- q = data['items'][0]
101
- print("Has body:", 'body' in q) # True
102
- print("Body preview:", q['body'][:60])
103
- # '<p>What functionality does the <a href="https://do...'
104
- ```
105
-
106
- ---
107
-
108
- ## HTML encoding in API responses
109
-
110
- The API returns HTML in two contexts, and plain text in a third:
111
-
112
- - **`body` field** (questions, answers, comments) — full HTML markup. Headings, code blocks, links, blockquotes, lists. Strip with `html.parser` for plain text.
113
- - **`title` field** — HTML-entity-encoded plain text. Quotes, angle brackets, and ampersands are escaped (`&quot;`, `&lt;`, `&amp;`). Decode with `html.unescape()`.
114
- - **`display_name`, `link`, `tags`** — plain text, no encoding.
115
-
116
- ```python
117
- import json, html
118
- from html.parser import HTMLParser
119
-
120
- data = json.loads(http_get("https://api.stackexchange.com/2.3/questions/231767?site=stackoverflow&filter=withbody"))
121
- q = data['items'][0]
122
-
123
- # Title has HTML entities
124
- print("Raw title:", q['title'])
125
- # 'What does the &quot;yield&quot; keyword do in Python?'
126
- print("Decoded:", html.unescape(q['title']))
127
- # 'What does the "yield" keyword do in Python?'
128
-
129
- # Body is full HTML — strip for plain text
130
- class Stripper(HTMLParser):
131
- def __init__(self):
132
- super().__init__()
133
- self.text = []
134
- def handle_data(self, d):
135
- self.text.append(d)
136
- def get_text(self):
137
- return ''.join(self.text)
138
-
139
- s = Stripper()
140
- s.feed(q['body'])
141
- print(s.get_text()[:200])
142
- # 'What functionality does the yield keyword do in Python?\nWhat is the ...'
143
- ```
144
-
145
- ---
146
-
147
- ## Common workflows
148
-
149
- ### Top questions by tag (API)
150
-
151
- ```python
152
- import json, html
153
- data = json.loads(http_get(
154
- "https://api.stackexchange.com/2.3/questions"
155
- "?order=desc&sort=votes&tagged=python&site=stackoverflow&pagesize=5&filter=withbody"
156
- ))
157
- for q in data['items']:
158
- print(q['question_id'], q['score'], html.unescape(q['title'])[:60])
159
- print(" Tags:", q['tags'][:3], "Answers:", q['answer_count'])
160
- print("Quota remaining:", data.get('quota_remaining'))
161
- # 231767 13133 What does the "yield" keyword do in Python?
162
- # Tags: ['python', 'iterator', 'generator'] Answers: 51
163
- # 419163 8438 What does if __name__ == "__main__": do?
164
- # Tags: ['python', 'namespaces', 'program-entry-point'] Answers: 40
165
- # Quota remaining: 299
166
- ```
167
-
168
- Sort options for `/questions`: `activity`, `votes`, `creation`, `hot`, `week`, `month`.
169
-
170
- ### Answers for a question
171
-
172
- ```python
173
- import json
174
- data = json.loads(http_get(
175
- "https://api.stackexchange.com/2.3/questions/231767/answers"
176
- "?order=desc&sort=votes&site=stackoverflow&filter=withbody&pagesize=3"
177
- ))
178
- for a in data['items']:
179
- print(f"Score: {a['score']}, Accepted: {a.get('is_accepted')}")
180
- print(f" Body preview: {a['body'][:150]}")
181
- # Score: 18307, Accepted: True
182
- # Body preview: <p>To understand what <a href="...">yield</a> does, ...
183
- # Score: 2596, Accepted: False
184
- # Score: 802, Accepted: False
185
- ```
186
-
187
- Answer fields (with `filter=withbody`): `answer_id`, `question_id`, `score`, `is_accepted`, `body`, `owner`, `creation_date`, `last_activity_date`, `content_license`.
188
-
189
- ### Fetch questions by ID (batch)
190
-
191
- Fetch up to 100 questions in one call using semicolons:
192
-
193
- ```python
194
- import json
195
- data = json.loads(http_get(
196
- "https://api.stackexchange.com/2.3/questions/231767;419163;394809"
197
- "?site=stackoverflow&filter=withbody"
198
- ))
199
- print("Fetched:", len(data['items'])) # 3
200
- for q in data['items']:
201
- print(q['question_id'], q['score'], q['title'][:50])
202
- # 231767 13133 What does the &quot;yield&quot; keyword do in Pyth
203
- # 419163 8438 What does if __name__ == &quot;__main__&quot;: do?
204
- # 394809 8125 Does Python have a ternary conditional operator?
205
- ```
206
-
207
- ### Search — `search/advanced` vs `search`
208
-
209
- Use `/search/advanced` when you need combined keyword + tag filtering. Use `/search` when searching only by title keyword (`intitle=`).
210
-
211
- ```python
212
- import json
213
-
214
- # search/advanced: keyword in body OR title, filtered by tag, sorted by relevance
215
- data = json.loads(http_get(
216
- "https://api.stackexchange.com/2.3/search/advanced"
217
- "?q=asyncio+event+loop&tagged=python&site=stackoverflow&pagesize=5&order=desc&sort=relevance"
218
- ))
219
- for q in data['items']:
220
- print(q['score'], q['answer_count'], q['title'][:70])
221
- # 137 3 "Asyncio Event Loop is Closed" when getting loop
222
- # 47 3 Can an asyncio event loop run in the background without suspending the
223
-
224
- # search: title-only keyword search via intitle=
225
- data = json.loads(http_get(
226
- "https://api.stackexchange.com/2.3/search"
227
- "?intitle=asyncio+event+loop&site=stackoverflow&pagesize=5&order=desc&sort=relevance"
228
- ))
229
- ```
230
-
231
- `search/advanced` additional params: `accepted=True` (only questions with accepted answers), `answers=1` (minimum answer count), `body=` (keyword in body), `user=` (filter by owner user ID), `views=` (minimum view count), `fromdate=`/`todate=` (Unix timestamps).
232
-
233
- ### User profile
234
-
235
- ```python
236
- import json
237
-
238
- # Basic user info
239
- user = json.loads(http_get("https://api.stackexchange.com/2.3/users/1?site=stackoverflow"))
240
- u = user['items'][0]
241
- print("User:", u['display_name'], "Rep:", u['reputation'], "Badges:", u['badge_counts'])
242
- # User: Jeff Atwood Rep: 64159 Badges: {'bronze': 153, 'silver': 153, 'gold': 48}
243
-
244
- # Fields: user_id, display_name, reputation, badge_counts, location, link,
245
- # creation_date, last_access_date, is_employee, account_id,
246
- # accept_rate, profile_image, website_url
247
-
248
- # Timeline (badge, question, answer events)
249
- data = json.loads(http_get("https://api.stackexchange.com/2.3/users/1/timeline?site=stackoverflow&pagesize=5"))
250
- print("Event types:", set(i['timeline_type'] for i in data['items']))
251
- # {'badge'}
252
-
253
- # User's top answers
254
- answers = json.loads(http_get("https://api.stackexchange.com/2.3/users/1/answers?site=stackoverflow&pagesize=5&order=desc&sort=votes"))
255
- for a in answers['items']:
256
- print("Score:", a['score'], "Question ID:", a.get('question_id'))
257
-
258
- # User's questions
259
- questions = json.loads(http_get("https://api.stackexchange.com/2.3/users/1/questions?site=stackoverflow&pagesize=3&order=desc&sort=votes"))
260
- for q in questions['items']:
261
- print(q['question_id'], q['score'], q['title'][:60])
262
- # 9 2273 How do I calculate someone&#39;s age based on a DateTime typ
263
- # 11 1656 Calculate relative time in C#
264
- ```
265
-
266
- ### Comments (requires `filter=withbody`)
267
-
268
- ```python
269
- import json
270
- data = json.loads(http_get(
271
- "https://api.stackexchange.com/2.3/questions/231767/comments"
272
- "?site=stackoverflow&pagesize=5&order=desc&sort=creation&filter=withbody"
273
- ))
274
- for c in data['items']:
275
- print("Score:", c['score'], "Body:", c.get('body','')[:80])
276
- # Comment keys (without filter): comment_id, content_license, creation_date,
277
- # edited, owner, post_id, reply_to_user, score
278
- # With filter=withbody: adds 'body' field (HTML-encoded)
279
- ```
280
-
281
- ### Related questions
282
-
283
- ```python
284
- import json
285
- related = json.loads(http_get(
286
- "https://api.stackexchange.com/2.3/questions/231767/related?site=stackoverflow&pagesize=5"
287
- ))
288
- for q in related['items']:
289
- print(q['question_id'], q['score'], q['title'][:60])
290
- # 25232350 15 how generators work in python
291
- # 28880095 11 What does a plain yield keyword do in Python?
292
- ```
293
-
294
- ### Popular tags
295
-
296
- ```python
297
- import json
298
- tags = json.loads(http_get("https://api.stackexchange.com/2.3/tags?order=desc&sort=popular&site=stackoverflow&pagesize=5"))
299
- for t in tags['items']:
300
- print(f"{t['name']}: {t['count']:,} questions")
301
- # javascript: 2,531,995 questions
302
- # java: 1,921,907 questions
303
- # c#: 1,626,728 questions
304
- # python: (check live — grows daily)
305
- ```
306
-
307
- ---
308
-
309
- ## Pagination
310
-
311
- Use `page=` (1-indexed) and `pagesize=` (max 100). Check `has_more` in the envelope to know whether a next page exists.
312
-
313
- ```python
314
- import json
315
-
316
- def fetch_all_pages(url_base, max_pages=5):
317
- """Fetch multiple pages from any Stack Exchange API endpoint."""
318
- results = []
319
- for page in range(1, max_pages + 1):
320
- data = json.loads(http_get(f"{url_base}&page={page}"))
321
- results.extend(data['items'])
322
- if not data.get('has_more'):
323
- break
324
- if data.get('backoff'):
325
- import time; time.sleep(data['backoff'])
326
- return results
327
-
328
- questions = fetch_all_pages(
329
- "https://api.stackexchange.com/2.3/questions?order=desc&sort=votes"
330
- "&tagged=python&site=stackoverflow&pagesize=10",
331
- max_pages=3
332
- )
333
- print("Total fetched:", len(questions)) # up to 30
334
- ```
335
-
336
- Note: `page=2` with `pagesize=3` returns the 4th–6th items. Confirmed working — `has_more: True` on page 2 of top Python questions.
337
-
338
- ---
339
-
340
- ## Parallel fetching (multiple questions or answers)
341
-
342
- ```python
343
- import json
344
- from concurrent.futures import ThreadPoolExecutor
345
-
346
- def fetch_top_answer(qid):
347
- data = json.loads(http_get(
348
- f"https://api.stackexchange.com/2.3/questions/{qid}/answers"
349
- "?order=desc&sort=votes&site=stackoverflow&filter=withbody&pagesize=1"
350
- ))
351
- if data['items']:
352
- a = data['items'][0]
353
- return {"qid": qid, "top_score": a['score'], "accepted": a.get('is_accepted')}
354
- return {"qid": qid, "top_score": 0}
355
-
356
- qids = [231767, 419163, 394809, 100003, 82831]
357
- with ThreadPoolExecutor(max_workers=3) as ex:
358
- results = list(ex.map(fetch_top_answer, qids))
359
-
360
- for r in results:
361
- print(r)
362
- # {'qid': 231767, 'top_score': 18307, 'accepted': True}
363
- # {'qid': 419163, 'top_score': 9051, 'accepted': True}
364
- # {'qid': 394809, 'top_score': 9355, 'accepted': True}
365
- # {'qid': 100003, 'top_score': 9334, 'accepted': False}
366
- # {'qid': 82831, 'top_score': 6793, 'accepted': False}
367
- ```
368
-
369
- Keep `max_workers` at 3 or below when unauthenticated — parallel calls consume quota simultaneously. At 3 workers, 5 questions used 5 quota units (expected).
370
-
371
- ---
372
-
373
- ## HTML page scraping (avoid for data tasks)
374
-
375
- The HTML page works but returns 777KB and has no clean `QAPage` JSON-LD. Use it only when you need something not in the API (e.g. rendered MathJax, ads context).
376
-
377
- ```python
378
- import re, html as htmllib
379
- headers = {"User-Agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36"}
380
- page = http_get("https://stackoverflow.com/questions/231767/what-does-the-yield-keyword-do-in-python", headers=headers)
381
- print("HTML length:", len(page)) # 777138
382
-
383
- # Page title (includes site suffix)
384
- title_m = re.search(r'<title>([^<]+)</title>', page)
385
- if title_m:
386
- print(htmllib.unescape(title_m.group(1)))
387
- # 'iterator - What does the "yield" keyword do in Python? - Stack Overflow'
388
-
389
- # Answer count via itemprop
390
- ans_count = re.search(r'itemprop="answerCount"[^>]*>(\d+)<', page)
391
- if ans_count:
392
- print("Answers:", ans_count.group(1)) # '51'
393
-
394
- # Score via itemprop (has whitespace around number)
395
- score_m = re.search(r'itemprop="upvoteCount"[^>]*>\s*(-?\d+)\s*<', page)
396
- if score_m:
397
- print("Score:", score_m.group(1)) # '13133'
398
-
399
- # JSON-LD is present but only has WebSite and Organization — NOT QAPage/Question
400
- ld_match = re.search(r'<script type="application/ld\+json">(.*?)</script>', page, re.DOTALL)
401
- if ld_match:
402
- d = json.loads(ld_match.group(1))
403
- types = [item.get('@type') for item in d.get('@graph', [])]
404
- print("JSON-LD types:", types) # ['WebSite', 'Organization'] — no QAPage
405
- ```
406
-
407
- ---
408
-
409
- ## Gotchas
410
-
411
- - **300 req/day unauthenticated is per IP, resets at midnight UTC.** 6 tests consumed ~27 quota units in one session. With parallel workers and loops, you can burn through 300 in minutes. Always check `quota_remaining` in responses.
412
-
413
- - **`filter=withbody` is required for body content.** Without it, `body` is simply absent from the response — no error, no empty string, just a missing key. Applies to questions, answers, AND comments.
414
-
415
- - **Title field has HTML entities, body field has full HTML markup.** They need different decoding strategies: `html.unescape()` for titles, `HTMLParser` stripping for bodies. Don't confuse them.
416
-
417
- - **Titles in API responses contain `&quot;`, `&lt;`, `&amp;`, `&#39;`** — raw output is `What does the &quot;yield&quot; keyword do in Python?`. Always call `html.unescape()` before displaying or comparing.
418
-
419
- - **Batch IDs with semicolons, not commas.** `/questions/231767;419163;394809` fetches 3 questions in one API call. Using commas returns a 400 error.
420
-
421
- - **`search/advanced` includes body text in results; `/search` only searches titles.** Use `search/advanced` with `q=` for full-text search. Use `/search` with `intitle=` for title-only.
422
-
423
- - **HTTP errors are raised as exceptions, not returned as JSON.** A bad `site=` param causes `urllib.error.HTTPError: HTTP Error 400: Bad Request` — there's no JSON body accessible from `http_get`. Wrap API calls in try/except.
424
-
425
- - **`backoff` in the response envelope must be respected.** If `data.get('backoff')` returns an integer (rare, typically 10–30 seconds), sleep that duration before the next call. Ignoring it will cause throttle errors on subsequent requests.
426
-
427
- - **`/info` endpoint wraps stats inside `items[0]`**, not directly in the envelope. Access as `data['items'][0]['total_questions']`.
428
-
429
- - **JSON-LD on the HTML page is NOT QAPage schema.** The `<script type="application/ld+json">` block only contains `WebSite` and `Organization` objects in the `@graph` array. There is no `Question`, `Answer`, or `QAPage` type — confirmed on the most-voted Python question (231767). Don't rely on structured data from the HTML page.
430
-
431
- - **User timeline `timeline_type` can be `badge`, `question`, `answer`, `comment`, `revision`, `suggested_edit`, `accepted`.** For very old/inactive users, all recent events may be `badge` only.
432
-
433
- - **Multi-site support.** Change `site=stackoverflow` to any Stack Exchange site: `site=superuser`, `site=serverfault`, `site=askubuntu`, `site=unix`, `site=datascience`, `site=math`. Same API, same quota pool per IP.
434
-
435
- - **`pagesize` max is 100.** Requesting more returns a 400 error. For bulk fetching, loop with `page=` and check `has_more`.
1
+ # Stack Overflow — Scraping & Data Extraction
2
+
3
+ `https://stackoverflow.com` — all public read-only data is available via the Stack Exchange API v2.3. No auth, no browser required for any read operation. API is fast, returns gzip-compressed JSON, and works transparently with `http_get`.
4
+
5
+ ## Do this first: pick your access path
6
+
7
+ | Goal | Best approach | Notes |
8
+ |------|--------------|-------|
9
+ | Top/hot questions by tag | `GET /2.3/questions` | Add `filter=withbody` for question text |
10
+ | Answers for a question | `GET /2.3/questions/{id}/answers` | Add `filter=withbody` for answer text |
11
+ | Search by keyword + tag | `GET /2.3/search/advanced` | More filters than `/search` |
12
+ | Simple title keyword search | `GET /2.3/search` | `intitle=` param |
13
+ | Fetch by known question IDs | `GET /2.3/questions/{id1};{id2};...` | Semicolon-delimited batch, up to 100 |
14
+ | User profile + reputation | `GET /2.3/users/{id}` | Public fields only |
15
+ | User activity timeline | `GET /2.3/users/{id}/timeline` | Events: badges, answers, questions |
16
+ | User's questions / answers | `GET /2.3/users/{id}/questions` or `/answers` | Standard listing |
17
+ | Comments on a post | `GET /2.3/questions/{id}/comments` | Needs `filter=withbody` for body |
18
+ | Related questions | `GET /2.3/questions/{id}/related` | Returns linked/similar questions |
19
+ | Answer by ID directly | `GET /2.3/answers/{id}` | One or more semicolon-separated IDs |
20
+ | Popular tags | `GET /2.3/tags` | Sort by `popular`, `activity`, or `name` |
21
+ | Site-wide statistics | `GET /2.3/info` | Total questions, quota, etc. |
22
+ | Question HTML page | `http_get` with User-Agent | Returns 777KB HTML; prefer API |
23
+
24
+ **Use the API for all data tasks.** The HTML page is 777KB, lacks clean structure, and the JSON-LD block only contains `WebSite` and `Organization` objects (no `QAPage` or `Question` schema). The API returns the same data in milliseconds, fully structured.
25
+
26
+ ---
27
+
28
+ ## Quota limits
29
+
30
+ The API is unauthenticated-friendly but strictly quota-capped per IP per day:
31
+
32
+ | Auth level | Daily quota | Burst |
33
+ |------------|-------------|-------|
34
+ | No key (unauthenticated) | **300 requests/day** | No enforced burst limit observed |
35
+ | With API key | **10,000 requests/day** | Same |
36
+
37
+ Check your remaining quota in every response envelope:
38
+
39
+ ```python
40
+ import json
41
+ data = json.loads(http_get("https://api.stackexchange.com/2.3/info?site=stackoverflow"))
42
+ print("Quota remaining:", data.get('quota_remaining')) # e.g. 273
43
+ print("Quota max:", data.get('quota_max')) # 300 unauthenticated, 10000 with key
44
+ # Confirmed: quota_max=300, quota_remaining decrements per call
45
+ ```
46
+
47
+ Every API response includes `quota_remaining` in the envelope. Monitor it. When it hits 0, all calls return HTTP 400 with `error_id: 502` (throttle_violation). There is no retry-after header — wait until midnight UTC.
48
+
49
+ **If you have an API key**, append `&key=YOUR_KEY` to any URL to use the 10,000/day quota.
50
+
51
+ ---
52
+
53
+ ## Response envelope
54
+
55
+ Every response from the Stack Exchange API is wrapped in a consistent envelope:
56
+
57
+ ```python
58
+ {
59
+ "items": [...], # list of result objects
60
+ "has_more": True/False, # whether more pages exist
61
+ "quota_max": 300, # total daily quota
62
+ "quota_remaining": 273, # calls left today
63
+ "backoff": None # seconds to wait before next call (rare)
64
+ }
65
+ ```
66
+
67
+ Always check `data.get('backoff')` — if it returns an integer, sleep that many seconds before the next call. Ignoring it causes throttle errors.
68
+
69
+ Error responses raise `urllib.error.HTTPError` (not a JSON envelope):
70
+ - HTTP 400 — invalid parameter (e.g. bad site name) — raises exception
71
+ - HTTP 400 with JSON body — quota exhausted or throttle_violation
72
+
73
+ ```python
74
+ try:
75
+ data = json.loads(http_get("https://api.stackexchange.com/2.3/questions?site=stackoverflow&pagesize=1"))
76
+ except Exception as e:
77
+ print("API error:", e) # HTTPError HTTP Error 400: Bad Request
78
+ ```
79
+
80
+ ---
81
+
82
+ ## `filter=withbody` — required for post content
83
+
84
+ By default, the API strips the `body` field from all responses. You **must** add `filter=withbody` to get question or answer text. This applies to questions, answers, and comments alike.
85
+
86
+ ```python
87
+ import json
88
+
89
+ # WITHOUT filter=withbody — body field is ABSENT
90
+ data = json.loads(http_get("https://api.stackexchange.com/2.3/questions?order=desc&sort=votes&tagged=python&site=stackoverflow&pagesize=1"))
91
+ q = data['items'][0]
92
+ print("Has body:", 'body' in q) # False
93
+ print("Keys:", sorted(q.keys()))
94
+ # ['accepted_answer_id', 'answer_count', 'content_license', 'creation_date',
95
+ # 'is_answered', 'last_activity_date', 'last_edit_date', 'link', 'owner',
96
+ # 'protected_date', 'question_id', 'score', 'tags', 'title', 'view_count']
97
+
98
+ # WITH filter=withbody — body field is PRESENT
99
+ data = json.loads(http_get("https://api.stackexchange.com/2.3/questions?order=desc&sort=votes&tagged=python&site=stackoverflow&pagesize=1&filter=withbody"))
100
+ q = data['items'][0]
101
+ print("Has body:", 'body' in q) # True
102
+ print("Body preview:", q['body'][:60])
103
+ # '<p>What functionality does the <a href="https://do...'
104
+ ```
105
+
106
+ ---
107
+
108
+ ## HTML encoding in API responses
109
+
110
+ The API returns HTML in two contexts, and plain text in a third:
111
+
112
+ - **`body` field** (questions, answers, comments) — full HTML markup. Headings, code blocks, links, blockquotes, lists. Strip with `html.parser` for plain text.
113
+ - **`title` field** — HTML-entity-encoded plain text. Quotes, angle brackets, and ampersands are escaped (`&quot;`, `&lt;`, `&amp;`). Decode with `html.unescape()`.
114
+ - **`display_name`, `link`, `tags`** — plain text, no encoding.
115
+
116
+ ```python
117
+ import json, html
118
+ from html.parser import HTMLParser
119
+
120
+ data = json.loads(http_get("https://api.stackexchange.com/2.3/questions/231767?site=stackoverflow&filter=withbody"))
121
+ q = data['items'][0]
122
+
123
+ # Title has HTML entities
124
+ print("Raw title:", q['title'])
125
+ # 'What does the &quot;yield&quot; keyword do in Python?'
126
+ print("Decoded:", html.unescape(q['title']))
127
+ # 'What does the "yield" keyword do in Python?'
128
+
129
+ # Body is full HTML — strip for plain text
130
+ class Stripper(HTMLParser):
131
+ def __init__(self):
132
+ super().__init__()
133
+ self.text = []
134
+ def handle_data(self, d):
135
+ self.text.append(d)
136
+ def get_text(self):
137
+ return ''.join(self.text)
138
+
139
+ s = Stripper()
140
+ s.feed(q['body'])
141
+ print(s.get_text()[:200])
142
+ # 'What functionality does the yield keyword do in Python?\nWhat is the ...'
143
+ ```
144
+
145
+ ---
146
+
147
+ ## Common workflows
148
+
149
+ ### Top questions by tag (API)
150
+
151
+ ```python
152
+ import json, html
153
+ data = json.loads(http_get(
154
+ "https://api.stackexchange.com/2.3/questions"
155
+ "?order=desc&sort=votes&tagged=python&site=stackoverflow&pagesize=5&filter=withbody"
156
+ ))
157
+ for q in data['items']:
158
+ print(q['question_id'], q['score'], html.unescape(q['title'])[:60])
159
+ print(" Tags:", q['tags'][:3], "Answers:", q['answer_count'])
160
+ print("Quota remaining:", data.get('quota_remaining'))
161
+ # 231767 13133 What does the "yield" keyword do in Python?
162
+ # Tags: ['python', 'iterator', 'generator'] Answers: 51
163
+ # 419163 8438 What does if __name__ == "__main__": do?
164
+ # Tags: ['python', 'namespaces', 'program-entry-point'] Answers: 40
165
+ # Quota remaining: 299
166
+ ```
167
+
168
+ Sort options for `/questions`: `activity`, `votes`, `creation`, `hot`, `week`, `month`.
169
+
170
+ ### Answers for a question
171
+
172
+ ```python
173
+ import json
174
+ data = json.loads(http_get(
175
+ "https://api.stackexchange.com/2.3/questions/231767/answers"
176
+ "?order=desc&sort=votes&site=stackoverflow&filter=withbody&pagesize=3"
177
+ ))
178
+ for a in data['items']:
179
+ print(f"Score: {a['score']}, Accepted: {a.get('is_accepted')}")
180
+ print(f" Body preview: {a['body'][:150]}")
181
+ # Score: 18307, Accepted: True
182
+ # Body preview: <p>To understand what <a href="...">yield</a> does, ...
183
+ # Score: 2596, Accepted: False
184
+ # Score: 802, Accepted: False
185
+ ```
186
+
187
+ Answer fields (with `filter=withbody`): `answer_id`, `question_id`, `score`, `is_accepted`, `body`, `owner`, `creation_date`, `last_activity_date`, `content_license`.
188
+
189
+ ### Fetch questions by ID (batch)
190
+
191
+ Fetch up to 100 questions in one call using semicolons:
192
+
193
+ ```python
194
+ import json
195
+ data = json.loads(http_get(
196
+ "https://api.stackexchange.com/2.3/questions/231767;419163;394809"
197
+ "?site=stackoverflow&filter=withbody"
198
+ ))
199
+ print("Fetched:", len(data['items'])) # 3
200
+ for q in data['items']:
201
+ print(q['question_id'], q['score'], q['title'][:50])
202
+ # 231767 13133 What does the &quot;yield&quot; keyword do in Pyth
203
+ # 419163 8438 What does if __name__ == &quot;__main__&quot;: do?
204
+ # 394809 8125 Does Python have a ternary conditional operator?
205
+ ```
206
+
207
+ ### Search — `search/advanced` vs `search`
208
+
209
+ Use `/search/advanced` when you need combined keyword + tag filtering. Use `/search` when searching only by title keyword (`intitle=`).
210
+
211
+ ```python
212
+ import json
213
+
214
+ # search/advanced: keyword in body OR title, filtered by tag, sorted by relevance
215
+ data = json.loads(http_get(
216
+ "https://api.stackexchange.com/2.3/search/advanced"
217
+ "?q=asyncio+event+loop&tagged=python&site=stackoverflow&pagesize=5&order=desc&sort=relevance"
218
+ ))
219
+ for q in data['items']:
220
+ print(q['score'], q['answer_count'], q['title'][:70])
221
+ # 137 3 "Asyncio Event Loop is Closed" when getting loop
222
+ # 47 3 Can an asyncio event loop run in the background without suspending the
223
+
224
+ # search: title-only keyword search via intitle=
225
+ data = json.loads(http_get(
226
+ "https://api.stackexchange.com/2.3/search"
227
+ "?intitle=asyncio+event+loop&site=stackoverflow&pagesize=5&order=desc&sort=relevance"
228
+ ))
229
+ ```
230
+
231
+ `search/advanced` additional params: `accepted=True` (only questions with accepted answers), `answers=1` (minimum answer count), `body=` (keyword in body), `user=` (filter by owner user ID), `views=` (minimum view count), `fromdate=`/`todate=` (Unix timestamps).
232
+
233
+ ### User profile
234
+
235
+ ```python
236
+ import json
237
+
238
+ # Basic user info
239
+ user = json.loads(http_get("https://api.stackexchange.com/2.3/users/1?site=stackoverflow"))
240
+ u = user['items'][0]
241
+ print("User:", u['display_name'], "Rep:", u['reputation'], "Badges:", u['badge_counts'])
242
+ # User: Jeff Atwood Rep: 64159 Badges: {'bronze': 153, 'silver': 153, 'gold': 48}
243
+
244
+ # Fields: user_id, display_name, reputation, badge_counts, location, link,
245
+ # creation_date, last_access_date, is_employee, account_id,
246
+ # accept_rate, profile_image, website_url
247
+
248
+ # Timeline (badge, question, answer events)
249
+ data = json.loads(http_get("https://api.stackexchange.com/2.3/users/1/timeline?site=stackoverflow&pagesize=5"))
250
+ print("Event types:", set(i['timeline_type'] for i in data['items']))
251
+ # {'badge'}
252
+
253
+ # User's top answers
254
+ answers = json.loads(http_get("https://api.stackexchange.com/2.3/users/1/answers?site=stackoverflow&pagesize=5&order=desc&sort=votes"))
255
+ for a in answers['items']:
256
+ print("Score:", a['score'], "Question ID:", a.get('question_id'))
257
+
258
+ # User's questions
259
+ questions = json.loads(http_get("https://api.stackexchange.com/2.3/users/1/questions?site=stackoverflow&pagesize=3&order=desc&sort=votes"))
260
+ for q in questions['items']:
261
+ print(q['question_id'], q['score'], q['title'][:60])
262
+ # 9 2273 How do I calculate someone&#39;s age based on a DateTime typ
263
+ # 11 1656 Calculate relative time in C#
264
+ ```
265
+
266
+ ### Comments (requires `filter=withbody`)
267
+
268
+ ```python
269
+ import json
270
+ data = json.loads(http_get(
271
+ "https://api.stackexchange.com/2.3/questions/231767/comments"
272
+ "?site=stackoverflow&pagesize=5&order=desc&sort=creation&filter=withbody"
273
+ ))
274
+ for c in data['items']:
275
+ print("Score:", c['score'], "Body:", c.get('body','')[:80])
276
+ # Comment keys (without filter): comment_id, content_license, creation_date,
277
+ # edited, owner, post_id, reply_to_user, score
278
+ # With filter=withbody: adds 'body' field (HTML-encoded)
279
+ ```
280
+
281
+ ### Related questions
282
+
283
+ ```python
284
+ import json
285
+ related = json.loads(http_get(
286
+ "https://api.stackexchange.com/2.3/questions/231767/related?site=stackoverflow&pagesize=5"
287
+ ))
288
+ for q in related['items']:
289
+ print(q['question_id'], q['score'], q['title'][:60])
290
+ # 25232350 15 how generators work in python
291
+ # 28880095 11 What does a plain yield keyword do in Python?
292
+ ```
293
+
294
+ ### Popular tags
295
+
296
+ ```python
297
+ import json
298
+ tags = json.loads(http_get("https://api.stackexchange.com/2.3/tags?order=desc&sort=popular&site=stackoverflow&pagesize=5"))
299
+ for t in tags['items']:
300
+ print(f"{t['name']}: {t['count']:,} questions")
301
+ # javascript: 2,531,995 questions
302
+ # java: 1,921,907 questions
303
+ # c#: 1,626,728 questions
304
+ # python: (check live — grows daily)
305
+ ```
306
+
307
+ ---
308
+
309
+ ## Pagination
310
+
311
+ Use `page=` (1-indexed) and `pagesize=` (max 100). Check `has_more` in the envelope to know whether a next page exists.
312
+
313
+ ```python
314
+ import json
315
+
316
+ def fetch_all_pages(url_base, max_pages=5):
317
+ """Fetch multiple pages from any Stack Exchange API endpoint."""
318
+ results = []
319
+ for page in range(1, max_pages + 1):
320
+ data = json.loads(http_get(f"{url_base}&page={page}"))
321
+ results.extend(data['items'])
322
+ if not data.get('has_more'):
323
+ break
324
+ if data.get('backoff'):
325
+ import time; time.sleep(data['backoff'])
326
+ return results
327
+
328
+ questions = fetch_all_pages(
329
+ "https://api.stackexchange.com/2.3/questions?order=desc&sort=votes"
330
+ "&tagged=python&site=stackoverflow&pagesize=10",
331
+ max_pages=3
332
+ )
333
+ print("Total fetched:", len(questions)) # up to 30
334
+ ```
335
+
336
+ Note: `page=2` with `pagesize=3` returns the 4th–6th items. Confirmed working — `has_more: True` on page 2 of top Python questions.
337
+
338
+ ---
339
+
340
+ ## Parallel fetching (multiple questions or answers)
341
+
342
+ ```python
343
+ import json
344
+ from concurrent.futures import ThreadPoolExecutor
345
+
346
+ def fetch_top_answer(qid):
347
+ data = json.loads(http_get(
348
+ f"https://api.stackexchange.com/2.3/questions/{qid}/answers"
349
+ "?order=desc&sort=votes&site=stackoverflow&filter=withbody&pagesize=1"
350
+ ))
351
+ if data['items']:
352
+ a = data['items'][0]
353
+ return {"qid": qid, "top_score": a['score'], "accepted": a.get('is_accepted')}
354
+ return {"qid": qid, "top_score": 0}
355
+
356
+ qids = [231767, 419163, 394809, 100003, 82831]
357
+ with ThreadPoolExecutor(max_workers=3) as ex:
358
+ results = list(ex.map(fetch_top_answer, qids))
359
+
360
+ for r in results:
361
+ print(r)
362
+ # {'qid': 231767, 'top_score': 18307, 'accepted': True}
363
+ # {'qid': 419163, 'top_score': 9051, 'accepted': True}
364
+ # {'qid': 394809, 'top_score': 9355, 'accepted': True}
365
+ # {'qid': 100003, 'top_score': 9334, 'accepted': False}
366
+ # {'qid': 82831, 'top_score': 6793, 'accepted': False}
367
+ ```
368
+
369
+ Keep `max_workers` at 3 or below when unauthenticated — parallel calls consume quota simultaneously. At 3 workers, 5 questions used 5 quota units (expected).
370
+
371
+ ---
372
+
373
+ ## HTML page scraping (avoid for data tasks)
374
+
375
+ The HTML page works but returns 777KB and has no clean `QAPage` JSON-LD. Use it only when you need something not in the API (e.g. rendered MathJax, ads context).
376
+
377
+ ```python
378
+ import re, html as htmllib
379
+ headers = {"User-Agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36"}
380
+ page = http_get("https://stackoverflow.com/questions/231767/what-does-the-yield-keyword-do-in-python", headers=headers)
381
+ print("HTML length:", len(page)) # 777138
382
+
383
+ # Page title (includes site suffix)
384
+ title_m = re.search(r'<title>([^<]+)</title>', page)
385
+ if title_m:
386
+ print(htmllib.unescape(title_m.group(1)))
387
+ # 'iterator - What does the "yield" keyword do in Python? - Stack Overflow'
388
+
389
+ # Answer count via itemprop
390
+ ans_count = re.search(r'itemprop="answerCount"[^>]*>(\d+)<', page)
391
+ if ans_count:
392
+ print("Answers:", ans_count.group(1)) # '51'
393
+
394
+ # Score via itemprop (has whitespace around number)
395
+ score_m = re.search(r'itemprop="upvoteCount"[^>]*>\s*(-?\d+)\s*<', page)
396
+ if score_m:
397
+ print("Score:", score_m.group(1)) # '13133'
398
+
399
+ # JSON-LD is present but only has WebSite and Organization — NOT QAPage/Question
400
+ ld_match = re.search(r'<script type="application/ld\+json">(.*?)</script>', page, re.DOTALL)
401
+ if ld_match:
402
+ d = json.loads(ld_match.group(1))
403
+ types = [item.get('@type') for item in d.get('@graph', [])]
404
+ print("JSON-LD types:", types) # ['WebSite', 'Organization'] — no QAPage
405
+ ```
406
+
407
+ ---
408
+
409
+ ## Gotchas
410
+
411
+ - **300 req/day unauthenticated is per IP, resets at midnight UTC.** 6 tests consumed ~27 quota units in one session. With parallel workers and loops, you can burn through 300 in minutes. Always check `quota_remaining` in responses.
412
+
413
+ - **`filter=withbody` is required for body content.** Without it, `body` is simply absent from the response — no error, no empty string, just a missing key. Applies to questions, answers, AND comments.
414
+
415
+ - **Title field has HTML entities, body field has full HTML markup.** They need different decoding strategies: `html.unescape()` for titles, `HTMLParser` stripping for bodies. Don't confuse them.
416
+
417
+ - **Titles in API responses contain `&quot;`, `&lt;`, `&amp;`, `&#39;`** — raw output is `What does the &quot;yield&quot; keyword do in Python?`. Always call `html.unescape()` before displaying or comparing.
418
+
419
+ - **Batch IDs with semicolons, not commas.** `/questions/231767;419163;394809` fetches 3 questions in one API call. Using commas returns a 400 error.
420
+
421
+ - **`search/advanced` includes body text in results; `/search` only searches titles.** Use `search/advanced` with `q=` for full-text search. Use `/search` with `intitle=` for title-only.
422
+
423
+ - **HTTP errors are raised as exceptions, not returned as JSON.** A bad `site=` param causes `urllib.error.HTTPError: HTTP Error 400: Bad Request` — there's no JSON body accessible from `http_get`. Wrap API calls in try/except.
424
+
425
+ - **`backoff` in the response envelope must be respected.** If `data.get('backoff')` returns an integer (rare, typically 10–30 seconds), sleep that duration before the next call. Ignoring it will cause throttle errors on subsequent requests.
426
+
427
+ - **`/info` endpoint wraps stats inside `items[0]`**, not directly in the envelope. Access as `data['items'][0]['total_questions']`.
428
+
429
+ - **JSON-LD on the HTML page is NOT QAPage schema.** The `<script type="application/ld+json">` block only contains `WebSite` and `Organization` objects in the `@graph` array. There is no `Question`, `Answer`, or `QAPage` type — confirmed on the most-voted Python question (231767). Don't rely on structured data from the HTML page.
430
+
431
+ - **User timeline `timeline_type` can be `badge`, `question`, `answer`, `comment`, `revision`, `suggested_edit`, `accepted`.** For very old/inactive users, all recent events may be `badge` only.
432
+
433
+ - **Multi-site support.** Change `site=stackoverflow` to any Stack Exchange site: `site=superuser`, `site=serverfault`, `site=askubuntu`, `site=unix`, `site=datascience`, `site=math`. Same API, same quota pool per IP.
434
+
435
+ - **`pagesize` max is 100.** Requesting more returns a 400 error. For bulk fetching, loop with `page=` and check `has_more`.