@pencil-agent/nano-pencil 2.0.0 → 2.0.1

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
Files changed (195) hide show
  1. package/README.md +267 -267
  2. package/dist/build-meta.json +3 -3
  3. package/dist/core/export-html/AGENT.md +11 -11
  4. package/dist/core/export-html/template.css +971 -971
  5. package/dist/core/export-html/template.html +54 -54
  6. package/dist/core/mcp/mcp-client.d.ts +3 -1
  7. package/dist/core/mcp/mcp-client.js +6 -6
  8. package/dist/core/mcp/mcp-config.d.ts +3 -3
  9. package/dist/core/mcp/mcp-config.js +1 -1
  10. package/dist/core/mcp/mcp-manager.d.ts +5 -1
  11. package/dist/core/mcp/mcp-manager.js +1 -1
  12. package/dist/core/platform/config/resource-loader.d.ts +2 -0
  13. package/dist/core/platform/config/resource-loader.js +2 -2
  14. package/dist/core/runtime/agent-session.d.ts +12 -0
  15. package/dist/core/runtime/agent-session.js +8 -8
  16. package/dist/core/runtime/sdk.d.ts +8 -0
  17. package/dist/core/runtime/sdk.js +1 -1
  18. package/dist/extensions/builtin/AGENT.md +115 -115
  19. package/dist/extensions/builtin/browser/AGENT.md +17 -17
  20. package/dist/extensions/builtin/browser/agent-workspace/agent_helpers.py +12 -12
  21. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/amazon/product-search.md +198 -198
  22. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/archive-org/scraping.md +341 -341
  23. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/arxiv/scraping.md +311 -311
  24. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/arxiv-bulk/scraping.md +333 -333
  25. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/atlas/overview.md +70 -70
  26. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/booking-com/scraping.md +578 -578
  27. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/capterra/scraping.md +440 -440
  28. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/centilebrain/generate-estimates.md +110 -110
  29. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/coingecko/scraping.md +325 -325
  30. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/coinmarketcap/scraping.md +463 -463
  31. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/coursera/scraping.md +360 -360
  32. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/craigslist/scraping.md +390 -390
  33. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/crossref/scraping.md +568 -568
  34. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/dev-to/scraping.md +323 -323
  35. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/duckduckgo/scraping.md +349 -349
  36. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/ebay/scraping.md +435 -435
  37. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/etsy/scraping.md +506 -506
  38. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/eventbrite/scraping.md +363 -363
  39. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/expedia/automation.md +168 -168
  40. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/facebook/groups.md +236 -236
  41. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/facebook/pages.md +295 -295
  42. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/framer/editor.md +108 -108
  43. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/fred/scraping.md +493 -493
  44. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/g2/scraping.md +580 -580
  45. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/genius/scraping.md +511 -511
  46. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/github/repo-actions.md +65 -65
  47. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/github/scraping.md +184 -184
  48. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/glassdoor/scraping.md +543 -543
  49. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/gmail/compose.md +122 -122
  50. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/goodreads/scraping.md +461 -461
  51. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/gutenberg/scraping.md +383 -383
  52. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/hackernews/scraping.md +243 -243
  53. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/howlongtobeat/scraping.md +473 -473
  54. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/imdb/scraping.md +271 -271
  55. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/itch-io/scraping.md +436 -436
  56. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/job-boards/indeed-glassdoor.md +1021 -1021
  57. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/letterboxd/scraping.md +349 -349
  58. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/linkedin/invitation-manager.md +109 -109
  59. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/loom/folder-enumeration.md +170 -170
  60. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/macrotrends/scraping.md +537 -537
  61. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/medium/article-hydration.md +120 -120
  62. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/medium/scraping.md +414 -414
  63. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/metacritic/scraping.md +477 -477
  64. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/musicbrainz/scraping.md +478 -478
  65. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/nasa/scraping.md +339 -339
  66. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/news-aggregation/multi-source.md +205 -205
  67. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/open-library/scraping.md +472 -472
  68. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/openalex/scraping.md +470 -470
  69. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/openstreetmap/scraping.md +490 -490
  70. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/package-registries/npm-pypi.md +478 -478
  71. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/polymarket/scraping.md +234 -234
  72. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/producthunt/scraping.md +307 -307
  73. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/pubmed/scraping.md +421 -421
  74. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/quora/scraping.md +364 -364
  75. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/rawg/scraping.md +352 -352
  76. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/reddit/scraping.md +124 -124
  77. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/rest-countries/scraping.md +233 -233
  78. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/sec-edgar/scraping.md +361 -361
  79. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/shopify-admin/README.md +36 -36
  80. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/shopify-admin/embedded-apps.md +72 -72
  81. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/shopify-admin/knowledge-base.md +109 -109
  82. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/shopify-admin/polaris-inputs.md +137 -137
  83. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/soundcloud/scraping.md +362 -362
  84. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/spotify/scraping.md +339 -339
  85. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/stackoverflow/scraping.md +435 -435
  86. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/steam/scraping.md +575 -575
  87. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/substack/scraping.md +338 -338
  88. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/thetechgeeks/pricing.md +52 -52
  89. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/tiktok/upload.md +107 -107
  90. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/tradingview/scraping.md +309 -309
  91. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/trello/boards-and-lists.md +88 -88
  92. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/trustpilot/scraping.md +375 -375
  93. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/walmart/scraping.md +444 -444
  94. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/wayback-machine/scraping.md +306 -306
  95. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/weather/scraping.md +398 -398
  96. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/wellfound/scraping.md +596 -596
  97. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/world-bank/scraping.md +356 -356
  98. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/xiaohongshu/scraping.md +84 -84
  99. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/youtube/scraping.md +418 -418
  100. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/zillow/scraping.md +433 -433
  101. package/dist/extensions/builtin/browser/browser.md +73 -73
  102. package/dist/extensions/builtin/browser/install.md +142 -142
  103. package/dist/extensions/builtin/browser/interaction-skills/connection.md +48 -48
  104. package/dist/extensions/builtin/browser/interaction-skills/cookies.md +3 -3
  105. package/dist/extensions/builtin/browser/interaction-skills/cross-origin-iframes.md +3 -3
  106. package/dist/extensions/builtin/browser/interaction-skills/dialogs.md +64 -64
  107. package/dist/extensions/builtin/browser/interaction-skills/downloads.md +3 -3
  108. package/dist/extensions/builtin/browser/interaction-skills/drag-and-drop.md +3 -3
  109. package/dist/extensions/builtin/browser/interaction-skills/dropdowns.md +3 -3
  110. package/dist/extensions/builtin/browser/interaction-skills/iframes.md +3 -3
  111. package/dist/extensions/builtin/browser/interaction-skills/network-requests.md +3 -3
  112. package/dist/extensions/builtin/browser/interaction-skills/print-as-pdf.md +3 -3
  113. package/dist/extensions/builtin/browser/interaction-skills/profile-sync.md +90 -90
  114. package/dist/extensions/builtin/browser/interaction-skills/screenshots.md +17 -17
  115. package/dist/extensions/builtin/browser/interaction-skills/scrolling.md +3 -3
  116. package/dist/extensions/builtin/browser/interaction-skills/shadow-dom.md +3 -3
  117. package/dist/extensions/builtin/browser/interaction-skills/tabs.md +69 -69
  118. package/dist/extensions/builtin/browser/interaction-skills/uploads.md +1 -1
  119. package/dist/extensions/builtin/browser/interaction-skills/viewport.md +3 -3
  120. package/dist/extensions/builtin/browser/src/browser_harness/AGENT.md +15 -15
  121. package/dist/extensions/builtin/browser/src/browser_harness/__init__.py +8 -8
  122. package/dist/extensions/builtin/browser/src/browser_harness/_ipc.py +90 -90
  123. package/dist/extensions/builtin/browser/src/browser_harness/admin.py +722 -722
  124. package/dist/extensions/builtin/browser/src/browser_harness/daemon.py +328 -328
  125. package/dist/extensions/builtin/browser/src/browser_harness/helpers.py +396 -396
  126. package/dist/extensions/builtin/browser/src/browser_harness/run.py +103 -103
  127. package/dist/extensions/builtin/discipline/skills/brainstorming/SKILL.md +33 -33
  128. package/dist/extensions/builtin/discipline/skills/executing-plans/SKILL.md +25 -25
  129. package/dist/extensions/builtin/discipline/skills/finishing-development-branch/SKILL.md +25 -25
  130. package/dist/extensions/builtin/discipline/skills/receiving-code-review/SKILL.md +22 -22
  131. package/dist/extensions/builtin/discipline/skills/requesting-code-review/SKILL.md +31 -31
  132. package/dist/extensions/builtin/discipline/skills/systematic-debugging/SKILL.md +28 -28
  133. package/dist/extensions/builtin/discipline/skills/test-driven-development/SKILL.md +32 -32
  134. package/dist/extensions/builtin/discipline/skills/using-git-worktrees/SKILL.md +25 -25
  135. package/dist/extensions/builtin/discipline/skills/verification-before-completion/SKILL.md +27 -27
  136. package/dist/extensions/builtin/discipline/skills/writing-plans/SKILL.md +26 -26
  137. package/dist/extensions/builtin/goal/README.md +67 -67
  138. package/dist/extensions/builtin/grub/README.md +112 -112
  139. package/dist/extensions/builtin/link-world/agent-workspace/README.md +16 -16
  140. package/dist/extensions/builtin/link-world/internet-search/internet-search.md +65 -65
  141. package/dist/extensions/builtin/link-world/link-world-agent.md +82 -82
  142. package/dist/extensions/builtin/link-world/linkworld.md +313 -313
  143. package/dist/extensions/builtin/link-world/network-routing/network-routing.md +67 -67
  144. package/dist/extensions/builtin/loop/README.md +92 -92
  145. package/dist/extensions/builtin/mcp/figma-design.md +68 -68
  146. package/dist/extensions/builtin/mcp/mcp-management.md +85 -85
  147. package/dist/extensions/builtin/recap/AGENT.md +15 -15
  148. package/dist/extensions/builtin/sal/README.md +72 -72
  149. package/dist/extensions/builtin/security-audit/README.md +289 -289
  150. package/dist/extensions/builtin/team/AGENT.md +112 -112
  151. package/dist/extensions/builtin/team/TESTING.md +299 -299
  152. package/dist/extensions/builtin/token-save/README.md +56 -56
  153. package/dist/extensions/optional/AGENT.md +10 -10
  154. package/dist/modes/interactive/interactive-mode.js +36 -36
  155. package/dist/modes/interactive/theme/dark.json +85 -85
  156. package/dist/modes/interactive/theme/light.json +84 -84
  157. package/dist/modes/interactive/theme/theme-schema.json +335 -335
  158. package/dist/modes/interactive/theme/warm.json +81 -81
  159. package/dist/node_modules/@pencil-agent/agent-core/dist/agent-loop.js +3 -2
  160. package/dist/node_modules/@pencil-agent/agent-core/dist/structured-adaptive-agent-loop.js +2 -1
  161. package/dist/node_modules/@pencil-agent/ai/dist/cli.js +0 -0
  162. package/docs/cc-agent-design.md +1297 -0
  163. package/docs/cc-tui-design.md +1333 -0
  164. package/docs/codex-goal-command-impl.md +1055 -1055
  165. package/docs/codex-goal-vs-grub.md +500 -500
  166. package/docs/custom-provider.md +27 -27
  167. package/docs/extensions.md +27 -27
  168. package/docs/keybindings.md +27 -27
  169. package/docs/loop /351/207/215/346/236/204/345/256/214/346/210/220/346/200/273/347/273/223.md" +250 -250
  170. package/docs/loop /351/207/215/346/236/204/345/256/214/346/210/220/346/212/245/345/221/212.md" +122 -122
  171. package/docs/loop /351/207/215/346/236/204/346/226/271/346/241/210.md" +1222 -1222
  172. package/docs/loop /351/207/215/346/236/204/346/226/271/346/241/210/345/256/236/347/216/260/346/212/245/345/221/212.md" +158 -158
  173. package/docs/loop /351/207/215/346/236/204/346/226/271/346/241/210/345/257/271/346/257/224/345/210/206/346/236/220.md" +128 -128
  174. package/docs/loop /351/207/215/346/236/204/350/256/241/345/210/222.md" +320 -320
  175. package/docs/loop-usage-examples.md +214 -214
  176. package/docs/models.md +27 -27
  177. package/docs/nanoPencil-/345/255/246/344/271/240/350/256/241/345/210/222.md +170 -0
  178. package/docs/packages.md +27 -27
  179. package/docs/pi-design-philosophy.md +457 -457
  180. package/docs/planmode.md +1987 -1987
  181. package/docs/prompt-templates.md +27 -27
  182. package/docs/providers.md +27 -27
  183. package/docs/scan-report.md +3820 -0
  184. package/docs/sdk.md +27 -27
  185. package/docs/skills.md +27 -27
  186. package/docs/themes.md +27 -27
  187. package/docs/tui.md +27 -27
  188. package/docs//345/257/271/346/240/207Claude-Code.md +1775 -0
  189. package/docs//351/230/277/351/207/214/345/267/264/345/267/264/350/264/242/346/212/245/345/210/206/346/236/220/344/271/246.md +261 -0
  190. package/package.json +190 -190
  191. package/docs/ACP/345/215/217/350/256/256/351/233/206/346/210/220/345/274/200/345/217/221/346/226/207/346/241/243.md +0 -851
  192. package/docs/SDK-TESTING.md +0 -364
  193. package/docs/mem-core/346/212/200/346/234/257/346/226/207/346/241/243.md +0 -593
  194. package/docs/startup-performance-optimization.md +0 -301
  195. package/docs//350/256/244/347/237/245/345/234/260/345/233/276.md +0 -47
@@ -1,436 +1,436 @@
1
- # itch.io — Scraping & Data Extraction
2
-
3
- Field-tested against itch.io on 2026-04-18. All code blocks validated with live requests.
4
-
5
- ---
6
-
7
- ## TL;DR — fastest approaches by task
8
-
9
- | Task | Method | Notes |
10
- |---|---|---|
11
- | Browse listings (36/page) | `http_get` HTML | Works, no key, no bot block |
12
- | Game detail (name, price, rating) | `http_get` + JSON-LD | `<script type="application/ld+json">` Product block |
13
- | Info table (tags, genre, status) | `http_get` + regex on `game_info_panel_widget` | Always present |
14
- | Top N games from any category | RSS `.xml` feed | Cleaner than HTML for bulk |
15
- | API (key endpoints) | `http_get` + key in path | Free keys at itch.io/docs/api |
16
- | Download/purchase counts | Not public | Owners only via dashboard |
17
-
18
- `http_get` works on all itch.io game and browse pages with no extra headers needed.
19
- No Cloudflare, no JS challenge, no CAPTCHA on standard game/browse routes.
20
-
21
- ---
22
-
23
- ## Approach 1 (Fastest for listings): RSS feeds — 36 games per call, clean XML
24
-
25
- Every browse URL has an `.xml` RSS variant. Returns price, pub/update dates, platforms, thumbnail. No HTML parsing.
26
-
27
- ```python
28
- import re
29
- from helpers import http_get
30
-
31
- def parse_rss(url):
32
- """
33
- Parse any itch.io RSS listing feed.
34
- url examples:
35
- https://itch.io/games/top-rated.xml
36
- https://itch.io/games/newest.xml
37
- https://itch.io/games/featured.xml
38
- https://itch.io/games/on-sale.xml
39
- https://itch.io/games/free.xml
40
- https://itch.io/games/tag-puzzle.xml # any tag slug works
41
- https://itch.io/games/top-rated.xml?page=2
42
- """
43
- xml = http_get(url)
44
- items = []
45
- for m in re.finditer(r'<item>(.*?)</item>', xml, re.DOTALL):
46
- ix = m.group(1)
47
- def get(tag, s=ix):
48
- tm = re.search(rf'<{tag}>(.*?)</{tag}>', s, re.DOTALL)
49
- return tm.group(1).strip() if tm else None
50
- items.append({
51
- 'url': get('guid'),
52
- 'title': get('plainTitle'), # clean title, no [tags]
53
- 'price': get('price'), # "$0.00", "$7.99", etc.
54
- 'currency': get('currency'), # "USD"
55
- 'pub_date': get('pubDate'),
56
- 'update_date': get('updateDate'),
57
- 'image': get('imageurl'), # 315x250 thumbnail
58
- 'platforms': {
59
- k: get(k) == 'yes'
60
- for k in ['windows', 'osx', 'linux', 'android', 'html']
61
- if get(k) is not None
62
- },
63
- })
64
- return items
65
-
66
- # Confirmed output:
67
- items = parse_rss("https://itch.io/games/top-rated.xml")
68
- # items[0] -> {
69
- # 'url': 'https://gbpatch.itch.io/our-life',
70
- # 'title': 'Our Life: Beginnings & Always',
71
- # 'price': '$0.00',
72
- # 'currency': 'USD',
73
- # 'pub_date': 'Fri, 07 Jun 2019 23:47:57 GMT',
74
- # 'update_date': 'Sun, 22 May 2022 15:48:27 GMT',
75
- # 'image': 'https://img.itch.zone/aW1nLzcwMTIxNDMucG5n/315x250%23c/BalGQb.png',
76
- # 'platforms': {'windows': True, 'osx': True, 'linux': True, 'android': True},
77
- # }
78
- ```
79
-
80
- **RSS limitations:** no rating score or count. Use HTML scraping (Approach 2) when you need ratings.
81
-
82
- ---
83
-
84
- ## Approach 2: HTML listings — ratings, genre, price, 36 games per page
85
-
86
- ```python
87
- import re
88
- from helpers import http_get
89
-
90
- def parse_game_cards(html):
91
- """
92
- Extract all game cards from any itch.io browse/listing/search/profile HTML page.
93
- Works on:
94
- https://itch.io/games/top-rated
95
- https://itch.io/games/newest
96
- https://itch.io/games/featured
97
- https://itch.io/games/on-sale
98
- https://itch.io/games/free
99
- https://itch.io/games/tag-puzzle (genre/tag path)
100
- https://itch.io/search?q=platformer (search — 54 cards per page)
101
- https://<author>.itch.io (author profile)
102
- All accept ?page=N for pagination.
103
- """
104
- games = []
105
- for m in re.finditer(r'data-game_id="(\d+)"', html):
106
- game_id = m.group(1)
107
- start = m.start()
108
- chunk = html[start:start + 3000]
109
-
110
- # Title + URL — attribute order differs between page 1 and pages 2+
111
- title_m = re.search(
112
- r'class="title game_link"[^>]*href="([^"]+)"[^>]*>([^<]+)</a>', chunk
113
- )
114
- if not title_m:
115
- title_m = re.search(
116
- r'href="([^"]+)"[^>]*class="title game_link"[^>]*>([^<]+)</a>', chunk
117
- )
118
-
119
- rating_m = re.search(
120
- r'data-tooltip="([\d.]+) average rating from ([\d,]+) total ratings"', chunk
121
- )
122
- genre_m = re.search(r'class="game_genre">([^<]+)</div>', chunk)
123
- price_m = re.search(r'class="price_value">([^<]+)</div>', chunk)
124
- desc_m = re.search(r'class="game_text" title="([^"]+)"', chunk)
125
- img_m = re.search(r'data-lazy_src="([^"]+)"', chunk)
126
- platforms = re.findall(r'title="Download for ([^"]+)"', chunk)
127
-
128
- games.append({
129
- 'id': game_id,
130
- 'url': title_m.group(1) if title_m else None,
131
- 'title': title_m.group(2).strip() if title_m else None,
132
- 'rating': float(rating_m.group(1)) if rating_m else None,
133
- 'rating_count': int(rating_m.group(2).replace(',', '')) if rating_m else None,
134
- 'genre': genre_m.group(1) if genre_m else None,
135
- 'price': price_m.group(1) if price_m else 'Free',
136
- 'description': desc_m.group(1) if desc_m else None,
137
- 'thumbnail': img_m.group(1) if img_m else None,
138
- 'platforms': platforms, # ['Windows', 'macOS', 'Linux', 'Android']
139
- })
140
- return games
141
-
142
- # Usage:
143
- html = http_get("https://itch.io/games/top-rated")
144
- games = parse_game_cards(html)
145
- # games[0] -> {
146
- # 'id': '434554', 'url': 'https://gbpatch.itch.io/our-life',
147
- # 'title': 'Our Life: Beginnings & Always',
148
- # 'rating': 4.94, 'rating_count': 7191,
149
- # 'genre': 'Visual Novel', 'price': 'Free',
150
- # 'platforms': ['Windows', 'Linux', 'macOS', 'Android'],
151
- # }
152
-
153
- # Paid game example:
154
- html = http_get("https://itch.io/games/top-rated?page=5")
155
- games = parse_game_cards(html)
156
- # Returns games where price_m captures '$7.99' when present
157
- ```
158
-
159
- ### CSS selector reference (for browser/JS use)
160
-
161
- ```
162
- .game_cell — one card per game
163
- .game_cell[data-game_id] — get game ID from attribute
164
- .game_cell .title.game_link — title text + href
165
- .game_cell .game_rating — rating container
166
- .game_cell .game_rating[data-tooltip] — "4.94 average rating from 7,191 total ratings"
167
- .game_cell .star_fill — inline style width: NN% (rating as percentage of 5)
168
- .game_cell .rating_count — "(7,191)"
169
- .game_cell .game_genre — genre text
170
- .game_cell .price_tag .price_value — price e.g. "$7.99" (absent = Free)
171
- .game_cell .game_text — one-line description (also in title attr)
172
- .game_cell .game_author a — author name + href
173
- .game_cell img.lazy_loaded — thumbnail (src in data-lazy_src before JS runs)
174
- ```
175
-
176
- **Gotcha — attribute order flips on page >= 2.** Page 1 uses `class="..." data-game_id="..."`, page 2+ uses `data-game_id="..." class="..."`. The regex above handles both. If you use a CSS selector engine, `[data-game_id]` is unambiguous.
177
-
178
- **Gotcha — ratings absent on some listing types.** The tag/genre browse pages (e.g. `/games/tag-puzzle`) sometimes omit the rating tooltip on the card even when the game has ratings. Fetch the detail page for the authoritative rating.
179
-
180
- ---
181
-
182
- ## Approach 3: Game detail page — JSON-LD Product schema
183
-
184
- The cleanest source for individual game data. All confirmed fields:
185
-
186
- ```python
187
- import json, re
188
- from helpers import http_get
189
-
190
- def extract_game_detail(url):
191
- """
192
- Fetch full metadata for a single itch.io game.
193
- url format: https://<author>.itch.io/<game-slug>
194
- """
195
- html = http_get(url)
196
-
197
- # --- JSON-LD (always present, covers name/description/price/rating) ---
198
- ld_product = None
199
- for block in re.findall(
200
- r'<script[^>]*type="application/ld\+json"[^>]*>(.*?)</script>',
201
- html, re.DOTALL
202
- ):
203
- ld = json.loads(block.strip())
204
- if ld.get('@type') == 'Product':
205
- ld_product = ld
206
- break
207
-
208
- # --- Info panel table (Status, Platforms, Genre, Tags, Author, etc.) ---
209
- info = {}
210
- panel_m = re.search(
211
- r'class="game_info_panel_widget[^"]*"[^>]*><table>(.*?)</table>',
212
- html, re.DOTALL
213
- )
214
- if panel_m:
215
- for row in re.finditer(
216
- r'<tr><td>([^<]+)</td><td>(.*?)</td></tr>',
217
- panel_m.group(1), re.DOTALL
218
- ):
219
- key = row.group(1).strip()
220
- val = re.sub(r'<[^>]+>', '', row.group(2)).strip()
221
- # Multi-value fields become lists (Tags, Platforms, Genre, Links)
222
- info[key] = [v.strip() for v in val.split(',')] if ',' in val else val
223
-
224
- # --- Cover image ---
225
- cover_m = re.search(r'<meta property="og:image" content="([^"]+)"', html)
226
-
227
- offers = (ld_product or {}).get('offers', {})
228
- agg = (ld_product or {}).get('aggregateRating', {})
229
-
230
- return {
231
- 'url': url,
232
- 'name': (ld_product or {}).get('name'),
233
- 'description': (ld_product or {}).get('description'),
234
- 'price': offers.get('price'), # "0.00" for free, "7.99" for paid
235
- 'currency': offers.get('priceCurrency'), # "USD"
236
- 'rating': agg.get('ratingValue'), # "4.9" string
237
- 'rating_count': agg.get('ratingCount'), # int
238
- 'cover': cover_m.group(1) if cover_m else None,
239
- 'info': info,
240
- }
241
-
242
- # Free game:
243
- r = extract_game_detail("https://gbpatch.itch.io/our-life")
244
- # {
245
- # 'name': 'Our Life: Beginnings & Always',
246
- # 'description': 'Grow from childhood to adulthood with the lonely boy next door...',
247
- # 'price': None, 'currency': None, <- no 'offers' block for free games
248
- # 'rating': '4.9', 'rating_count': 7191,
249
- # 'cover': 'https://img.itch.zone/aW1hZ2Uv.../347x500/7HqrvV.jpg',
250
- # 'info': {
251
- # 'Status': 'Released',
252
- # 'Platforms': ['Windows', 'macOS', 'Linux', 'Android'],
253
- # 'Rating': 'Rated 4.9 out of 5 stars(7,191 total ratings)',
254
- # 'Author': 'GBPatch',
255
- # 'Genre': ['Visual Novel', 'Interactive Fiction'],
256
- # 'Tags': ['Amare', 'Comedy', 'Dating Sim', 'Gay', 'LGBT', ...],
257
- # 'Links': 'Steam',
258
- # }
259
- # }
260
-
261
- # Paid game:
262
- r = extract_game_detail("https://adamgryu.itch.io/a-short-hike")
263
- # {
264
- # 'name': 'A Short Hike',
265
- # 'price': '7.99', 'currency': 'USD',
266
- # 'rating': '4.9', 'rating_count': 4307,
267
- # 'info': {
268
- # 'Status': 'Released',
269
- # 'Platforms': ['Windows', 'macOS', 'Linux'],
270
- # 'Release date': 'Jul 30, 2019',
271
- # 'Genre': ['Adventure', 'Platformer'],
272
- # 'Made with': 'Unity',
273
- # 'Tags': ['3D', 'Atmospheric', 'Cute', 'Relaxing', 'Short', ...],
274
- # 'Average session': 'About an hour',
275
- # 'Languages': ['English', 'Spanish; Latin America', 'French', ...],
276
- # 'Inputs': ['Keyboard', 'Mouse', 'Xbox controller', ...],
277
- # 'Accessibility': ['Subtitles', 'Configurable controls'],
278
- # 'Links': ['Steam', 'Homepage', 'Soundtrack', 'Twitter/X'],
279
- # }
280
- # }
281
- ```
282
-
283
- **JSON-LD available fields:**
284
-
285
- | Field | Free game | Paid game |
286
- |---|---|---|
287
- | `@type` | `Product` | `Product` |
288
- | `name` | yes | yes |
289
- | `description` | yes | yes |
290
- | `aggregateRating.ratingValue` | yes | yes |
291
- | `aggregateRating.ratingCount` | yes | yes |
292
- | `offers.price` | absent | yes ("7.99") |
293
- | `offers.priceCurrency` | absent | yes ("USD") |
294
- | `offers.seller.name` | absent | yes (author name) |
295
- | `offers.seller.url` | absent | yes (author profile URL) |
296
-
297
- ---
298
-
299
- ## Pagination
300
-
301
- Browse pages: `?page=N`. Detect end of results by HTTP 404 (page too high) or absent `<link rel="next">`.
302
-
303
- ```python
304
- import re
305
- from helpers import http_get
306
-
307
- def paginate_listing(base_url, max_pages=10):
308
- """
309
- Scrape multiple pages from any itch.io browse URL.
310
- base_url: https://itch.io/games/top-rated (no ?page= suffix)
311
- Returns flat list of game dicts.
312
- Stops when HTTP 404 or no <link rel="next"> found.
313
- """
314
- all_games = []
315
- page = 1
316
- while page <= max_pages:
317
- url = base_url if page == 1 else f"{base_url}?page={page}"
318
- try:
319
- html = http_get(url)
320
- except Exception:
321
- break # 404 = past last page
322
- all_games.extend(parse_game_cards(html))
323
- if not re.search(r'<link[^>]+rel="next"[^>]*/>', html):
324
- break
325
- page += 1
326
- return all_games
327
-
328
- # Confirmed: page 1 has <link href="?page=2" rel="next"/>
329
- # page 2 has <link rel="prev" href="/games/top-rated"/> and <link rel="next" href="?page=3"/>
330
- # past last page returns HTTP 404
331
- # top-rated has at least 200 pages (each 36 games); page 300+ -> 404
332
- ```
333
-
334
- ---
335
-
336
- ## Browse URL patterns
337
-
338
- All confirmed working via `http_get`:
339
-
340
- ```python
341
- BASE = "https://itch.io/games"
342
-
343
- # Sort orders
344
- f"{BASE}/top-rated" # all-time top rated (rated by community, 0–5 stars)
345
- f"{BASE}/newest" # most recently published
346
- f"{BASE}/featured" # itch.io staff picks
347
- f"{BASE}/on-sale" # discounted games
348
- f"{BASE}/free" # free games only
349
-
350
- # Genre/tag paths (append .xml for RSS)
351
- f"{BASE}/tag-puzzle" # tag slug — prefix with 'tag-'
352
- f"{BASE}/genre-action" # genre — prefix with 'genre-' (less common)
353
-
354
- # Combine: tag + sort via separate pages (no combined URL that survives http_get)
355
- # Note: https://itch.io/games/top-rated/tag-puzzle -> HTTP 403
356
- # Note: ?tag= query param does NOT filter server-side (returns same games)
357
-
358
- # Pagination
359
- f"{BASE}/top-rated?page=2"
360
- f"{BASE}/tag-puzzle?page=3"
361
-
362
- # RSS equivalents (36 items, no pagination needed for small sets)
363
- f"{BASE}/top-rated.xml"
364
- f"{BASE}/tag-puzzle.xml"
365
- f"{BASE}/tag-puzzle.xml?page=2"
366
-
367
- # Search (54 results/page, no server-side pagination beyond page 1 via http_get)
368
- "https://itch.io/search?q=platformer"
369
-
370
- # Author profile
371
- "https://<author-slug>.itch.io"
372
- ```
373
-
374
- ---
375
-
376
- ## API (requires key)
377
-
378
- itch.io has an official REST API. A free key is issued per-account with no rate limit published.
379
- Get one at: `https://itch.io/user/settings/api-keys`
380
-
381
- Base URL: `https://itch.io/api/1/<key>/`
382
-
383
- ```python
384
- import json
385
- from helpers import http_get
386
-
387
- ITCH_KEY = "your_api_key_here" # from https://itch.io/user/settings/api-keys
388
-
389
- def api(path):
390
- return json.loads(http_get(f"https://itch.io/api/1/{ITCH_KEY}/{path}"))
391
-
392
- # Authenticated user info
393
- api("me")
394
- # -> {"user": {"id": ..., "username": "...", "url": "...", "display_name": "...", ...}}
395
-
396
- # Games owned by authenticated user
397
- api("my-games")
398
- # -> {"games": [{"id": ..., "title": "...", "url": "...", "created_at": "...",
399
- # "published": true/false, "min_price": 0, ...}, ...]}
400
-
401
- # Download keys for a game (owner only)
402
- api("game/434554/download_keys")
403
-
404
- # Credentials (for authenticated purchases)
405
- api("game/434554/credentials")
406
- ```
407
-
408
- **Error structure:** invalid/missing key returns `{"errors": ["invalid key"]}` with HTTP 200.
409
- Non-existent endpoints return HTTP 404.
410
-
411
- **No unauthenticated game lookup API.** `https://itch.io/api/1/x/games` -> HTTP 404.
412
- Use HTML scraping or RSS for unauthenticated game data.
413
-
414
- ---
415
-
416
- ## Gotchas
417
-
418
- 1. **Attribute order flips page 1 vs 2+.** On page 1, game cards use `class="game_cell ..." data-game_id="..."`. On pages 2+, the order is `data-game_id="..." class="game_cell ..."`. Always match `data-game_id` independently of class ordering.
419
-
420
- 2. **Ratings absent on tag/genre listing pages.** The `data-tooltip` with rating is often missing from card HTML on `/games/tag-*` pages even though the game has ratings. Fetch the detail page for `aggregateRating` via JSON-LD.
421
-
422
- 3. **`price_value` absent = Free.** Paid games have `<div class="price_tag meta_tag" title="Pay $7.99 or more..."><div class="price_value">$7.99</div></div>`. Free games have no such element. Default to `'Free'` when absent.
423
-
424
- 4. **Free-game JSON-LD has no `offers` block.** Only paid games include the `offers` object. For free games, use absence of `offers` as the signal, not presence of `price: 0`.
425
-
426
- 5. **`/games/top-rated/tag-puzzle` returns HTTP 403.** Cannot combine sort + tag in a path. Use separate `/games/tag-puzzle` (top-rated is the default sort anyway).
427
-
428
- 6. **`?tag=` query param is ignored server-side.** `https://itch.io/games/top-rated?tag=puzzle` returns the same games as `?top-rated`. Use `/games/tag-puzzle` path instead.
429
-
430
- 7. **Download/purchase counts are not public.** No count field appears anywhere in the public HTML, JSON-LD, RSS, or unauthenticated API. Game owners see their stats in the dashboard only.
431
-
432
- 8. **Search beyond page 1 is AJAX-only.** `https://itch.io/search?q=X&page=2` via `http_get` returns the same 54 results as page 1. To get more search results use the browser and scroll/click "load more".
433
-
434
- 9. **RSS is capped at 36 items per page.** Paginate with `?page=N`. Very high page numbers (300+) return HTTP 404 on browse pages.
435
-
436
- 10. **Unicode zero-width space in some titles.** `\u200b` (zero-width space) appears at the start of certain titles (e.g. "​Our Life: Beginnings & Always"). Strip with `.replace('\u200b', '').strip()` or `.strip()` alone won't remove it — use `title.replace('\u200b', '').strip()`.
1
+ # itch.io — Scraping & Data Extraction
2
+
3
+ Field-tested against itch.io on 2026-04-18. All code blocks validated with live requests.
4
+
5
+ ---
6
+
7
+ ## TL;DR — fastest approaches by task
8
+
9
+ | Task | Method | Notes |
10
+ |---|---|---|
11
+ | Browse listings (36/page) | `http_get` HTML | Works, no key, no bot block |
12
+ | Game detail (name, price, rating) | `http_get` + JSON-LD | `<script type="application/ld+json">` Product block |
13
+ | Info table (tags, genre, status) | `http_get` + regex on `game_info_panel_widget` | Always present |
14
+ | Top N games from any category | RSS `.xml` feed | Cleaner than HTML for bulk |
15
+ | API (key endpoints) | `http_get` + key in path | Free keys at itch.io/docs/api |
16
+ | Download/purchase counts | Not public | Owners only via dashboard |
17
+
18
+ `http_get` works on all itch.io game and browse pages with no extra headers needed.
19
+ No Cloudflare, no JS challenge, no CAPTCHA on standard game/browse routes.
20
+
21
+ ---
22
+
23
+ ## Approach 1 (Fastest for listings): RSS feeds — 36 games per call, clean XML
24
+
25
+ Every browse URL has an `.xml` RSS variant. Returns price, pub/update dates, platforms, thumbnail. No HTML parsing.
26
+
27
+ ```python
28
+ import re
29
+ from helpers import http_get
30
+
31
+ def parse_rss(url):
32
+ """
33
+ Parse any itch.io RSS listing feed.
34
+ url examples:
35
+ https://itch.io/games/top-rated.xml
36
+ https://itch.io/games/newest.xml
37
+ https://itch.io/games/featured.xml
38
+ https://itch.io/games/on-sale.xml
39
+ https://itch.io/games/free.xml
40
+ https://itch.io/games/tag-puzzle.xml # any tag slug works
41
+ https://itch.io/games/top-rated.xml?page=2
42
+ """
43
+ xml = http_get(url)
44
+ items = []
45
+ for m in re.finditer(r'<item>(.*?)</item>', xml, re.DOTALL):
46
+ ix = m.group(1)
47
+ def get(tag, s=ix):
48
+ tm = re.search(rf'<{tag}>(.*?)</{tag}>', s, re.DOTALL)
49
+ return tm.group(1).strip() if tm else None
50
+ items.append({
51
+ 'url': get('guid'),
52
+ 'title': get('plainTitle'), # clean title, no [tags]
53
+ 'price': get('price'), # "$0.00", "$7.99", etc.
54
+ 'currency': get('currency'), # "USD"
55
+ 'pub_date': get('pubDate'),
56
+ 'update_date': get('updateDate'),
57
+ 'image': get('imageurl'), # 315x250 thumbnail
58
+ 'platforms': {
59
+ k: get(k) == 'yes'
60
+ for k in ['windows', 'osx', 'linux', 'android', 'html']
61
+ if get(k) is not None
62
+ },
63
+ })
64
+ return items
65
+
66
+ # Confirmed output:
67
+ items = parse_rss("https://itch.io/games/top-rated.xml")
68
+ # items[0] -> {
69
+ # 'url': 'https://gbpatch.itch.io/our-life',
70
+ # 'title': 'Our Life: Beginnings & Always',
71
+ # 'price': '$0.00',
72
+ # 'currency': 'USD',
73
+ # 'pub_date': 'Fri, 07 Jun 2019 23:47:57 GMT',
74
+ # 'update_date': 'Sun, 22 May 2022 15:48:27 GMT',
75
+ # 'image': 'https://img.itch.zone/aW1nLzcwMTIxNDMucG5n/315x250%23c/BalGQb.png',
76
+ # 'platforms': {'windows': True, 'osx': True, 'linux': True, 'android': True},
77
+ # }
78
+ ```
79
+
80
+ **RSS limitations:** no rating score or count. Use HTML scraping (Approach 2) when you need ratings.
81
+
82
+ ---
83
+
84
+ ## Approach 2: HTML listings — ratings, genre, price, 36 games per page
85
+
86
+ ```python
87
+ import re
88
+ from helpers import http_get
89
+
90
+ def parse_game_cards(html):
91
+ """
92
+ Extract all game cards from any itch.io browse/listing/search/profile HTML page.
93
+ Works on:
94
+ https://itch.io/games/top-rated
95
+ https://itch.io/games/newest
96
+ https://itch.io/games/featured
97
+ https://itch.io/games/on-sale
98
+ https://itch.io/games/free
99
+ https://itch.io/games/tag-puzzle (genre/tag path)
100
+ https://itch.io/search?q=platformer (search — 54 cards per page)
101
+ https://<author>.itch.io (author profile)
102
+ All accept ?page=N for pagination.
103
+ """
104
+ games = []
105
+ for m in re.finditer(r'data-game_id="(\d+)"', html):
106
+ game_id = m.group(1)
107
+ start = m.start()
108
+ chunk = html[start:start + 3000]
109
+
110
+ # Title + URL — attribute order differs between page 1 and pages 2+
111
+ title_m = re.search(
112
+ r'class="title game_link"[^>]*href="([^"]+)"[^>]*>([^<]+)</a>', chunk
113
+ )
114
+ if not title_m:
115
+ title_m = re.search(
116
+ r'href="([^"]+)"[^>]*class="title game_link"[^>]*>([^<]+)</a>', chunk
117
+ )
118
+
119
+ rating_m = re.search(
120
+ r'data-tooltip="([\d.]+) average rating from ([\d,]+) total ratings"', chunk
121
+ )
122
+ genre_m = re.search(r'class="game_genre">([^<]+)</div>', chunk)
123
+ price_m = re.search(r'class="price_value">([^<]+)</div>', chunk)
124
+ desc_m = re.search(r'class="game_text" title="([^"]+)"', chunk)
125
+ img_m = re.search(r'data-lazy_src="([^"]+)"', chunk)
126
+ platforms = re.findall(r'title="Download for ([^"]+)"', chunk)
127
+
128
+ games.append({
129
+ 'id': game_id,
130
+ 'url': title_m.group(1) if title_m else None,
131
+ 'title': title_m.group(2).strip() if title_m else None,
132
+ 'rating': float(rating_m.group(1)) if rating_m else None,
133
+ 'rating_count': int(rating_m.group(2).replace(',', '')) if rating_m else None,
134
+ 'genre': genre_m.group(1) if genre_m else None,
135
+ 'price': price_m.group(1) if price_m else 'Free',
136
+ 'description': desc_m.group(1) if desc_m else None,
137
+ 'thumbnail': img_m.group(1) if img_m else None,
138
+ 'platforms': platforms, # ['Windows', 'macOS', 'Linux', 'Android']
139
+ })
140
+ return games
141
+
142
+ # Usage:
143
+ html = http_get("https://itch.io/games/top-rated")
144
+ games = parse_game_cards(html)
145
+ # games[0] -> {
146
+ # 'id': '434554', 'url': 'https://gbpatch.itch.io/our-life',
147
+ # 'title': 'Our Life: Beginnings & Always',
148
+ # 'rating': 4.94, 'rating_count': 7191,
149
+ # 'genre': 'Visual Novel', 'price': 'Free',
150
+ # 'platforms': ['Windows', 'Linux', 'macOS', 'Android'],
151
+ # }
152
+
153
+ # Paid game example:
154
+ html = http_get("https://itch.io/games/top-rated?page=5")
155
+ games = parse_game_cards(html)
156
+ # Returns games where price_m captures '$7.99' when present
157
+ ```
158
+
159
+ ### CSS selector reference (for browser/JS use)
160
+
161
+ ```
162
+ .game_cell — one card per game
163
+ .game_cell[data-game_id] — get game ID from attribute
164
+ .game_cell .title.game_link — title text + href
165
+ .game_cell .game_rating — rating container
166
+ .game_cell .game_rating[data-tooltip] — "4.94 average rating from 7,191 total ratings"
167
+ .game_cell .star_fill — inline style width: NN% (rating as percentage of 5)
168
+ .game_cell .rating_count — "(7,191)"
169
+ .game_cell .game_genre — genre text
170
+ .game_cell .price_tag .price_value — price e.g. "$7.99" (absent = Free)
171
+ .game_cell .game_text — one-line description (also in title attr)
172
+ .game_cell .game_author a — author name + href
173
+ .game_cell img.lazy_loaded — thumbnail (src in data-lazy_src before JS runs)
174
+ ```
175
+
176
+ **Gotcha — attribute order flips on page >= 2.** Page 1 uses `class="..." data-game_id="..."`, page 2+ uses `data-game_id="..." class="..."`. The regex above handles both. If you use a CSS selector engine, `[data-game_id]` is unambiguous.
177
+
178
+ **Gotcha — ratings absent on some listing types.** The tag/genre browse pages (e.g. `/games/tag-puzzle`) sometimes omit the rating tooltip on the card even when the game has ratings. Fetch the detail page for the authoritative rating.
179
+
180
+ ---
181
+
182
+ ## Approach 3: Game detail page — JSON-LD Product schema
183
+
184
+ The cleanest source for individual game data. All confirmed fields:
185
+
186
+ ```python
187
+ import json, re
188
+ from helpers import http_get
189
+
190
+ def extract_game_detail(url):
191
+ """
192
+ Fetch full metadata for a single itch.io game.
193
+ url format: https://<author>.itch.io/<game-slug>
194
+ """
195
+ html = http_get(url)
196
+
197
+ # --- JSON-LD (always present, covers name/description/price/rating) ---
198
+ ld_product = None
199
+ for block in re.findall(
200
+ r'<script[^>]*type="application/ld\+json"[^>]*>(.*?)</script>',
201
+ html, re.DOTALL
202
+ ):
203
+ ld = json.loads(block.strip())
204
+ if ld.get('@type') == 'Product':
205
+ ld_product = ld
206
+ break
207
+
208
+ # --- Info panel table (Status, Platforms, Genre, Tags, Author, etc.) ---
209
+ info = {}
210
+ panel_m = re.search(
211
+ r'class="game_info_panel_widget[^"]*"[^>]*><table>(.*?)</table>',
212
+ html, re.DOTALL
213
+ )
214
+ if panel_m:
215
+ for row in re.finditer(
216
+ r'<tr><td>([^<]+)</td><td>(.*?)</td></tr>',
217
+ panel_m.group(1), re.DOTALL
218
+ ):
219
+ key = row.group(1).strip()
220
+ val = re.sub(r'<[^>]+>', '', row.group(2)).strip()
221
+ # Multi-value fields become lists (Tags, Platforms, Genre, Links)
222
+ info[key] = [v.strip() for v in val.split(',')] if ',' in val else val
223
+
224
+ # --- Cover image ---
225
+ cover_m = re.search(r'<meta property="og:image" content="([^"]+)"', html)
226
+
227
+ offers = (ld_product or {}).get('offers', {})
228
+ agg = (ld_product or {}).get('aggregateRating', {})
229
+
230
+ return {
231
+ 'url': url,
232
+ 'name': (ld_product or {}).get('name'),
233
+ 'description': (ld_product or {}).get('description'),
234
+ 'price': offers.get('price'), # "0.00" for free, "7.99" for paid
235
+ 'currency': offers.get('priceCurrency'), # "USD"
236
+ 'rating': agg.get('ratingValue'), # "4.9" string
237
+ 'rating_count': agg.get('ratingCount'), # int
238
+ 'cover': cover_m.group(1) if cover_m else None,
239
+ 'info': info,
240
+ }
241
+
242
+ # Free game:
243
+ r = extract_game_detail("https://gbpatch.itch.io/our-life")
244
+ # {
245
+ # 'name': 'Our Life: Beginnings & Always',
246
+ # 'description': 'Grow from childhood to adulthood with the lonely boy next door...',
247
+ # 'price': None, 'currency': None, <- no 'offers' block for free games
248
+ # 'rating': '4.9', 'rating_count': 7191,
249
+ # 'cover': 'https://img.itch.zone/aW1hZ2Uv.../347x500/7HqrvV.jpg',
250
+ # 'info': {
251
+ # 'Status': 'Released',
252
+ # 'Platforms': ['Windows', 'macOS', 'Linux', 'Android'],
253
+ # 'Rating': 'Rated 4.9 out of 5 stars(7,191 total ratings)',
254
+ # 'Author': 'GBPatch',
255
+ # 'Genre': ['Visual Novel', 'Interactive Fiction'],
256
+ # 'Tags': ['Amare', 'Comedy', 'Dating Sim', 'Gay', 'LGBT', ...],
257
+ # 'Links': 'Steam',
258
+ # }
259
+ # }
260
+
261
+ # Paid game:
262
+ r = extract_game_detail("https://adamgryu.itch.io/a-short-hike")
263
+ # {
264
+ # 'name': 'A Short Hike',
265
+ # 'price': '7.99', 'currency': 'USD',
266
+ # 'rating': '4.9', 'rating_count': 4307,
267
+ # 'info': {
268
+ # 'Status': 'Released',
269
+ # 'Platforms': ['Windows', 'macOS', 'Linux'],
270
+ # 'Release date': 'Jul 30, 2019',
271
+ # 'Genre': ['Adventure', 'Platformer'],
272
+ # 'Made with': 'Unity',
273
+ # 'Tags': ['3D', 'Atmospheric', 'Cute', 'Relaxing', 'Short', ...],
274
+ # 'Average session': 'About an hour',
275
+ # 'Languages': ['English', 'Spanish; Latin America', 'French', ...],
276
+ # 'Inputs': ['Keyboard', 'Mouse', 'Xbox controller', ...],
277
+ # 'Accessibility': ['Subtitles', 'Configurable controls'],
278
+ # 'Links': ['Steam', 'Homepage', 'Soundtrack', 'Twitter/X'],
279
+ # }
280
+ # }
281
+ ```
282
+
283
+ **JSON-LD available fields:**
284
+
285
+ | Field | Free game | Paid game |
286
+ |---|---|---|
287
+ | `@type` | `Product` | `Product` |
288
+ | `name` | yes | yes |
289
+ | `description` | yes | yes |
290
+ | `aggregateRating.ratingValue` | yes | yes |
291
+ | `aggregateRating.ratingCount` | yes | yes |
292
+ | `offers.price` | absent | yes ("7.99") |
293
+ | `offers.priceCurrency` | absent | yes ("USD") |
294
+ | `offers.seller.name` | absent | yes (author name) |
295
+ | `offers.seller.url` | absent | yes (author profile URL) |
296
+
297
+ ---
298
+
299
+ ## Pagination
300
+
301
+ Browse pages: `?page=N`. Detect end of results by HTTP 404 (page too high) or absent `<link rel="next">`.
302
+
303
+ ```python
304
+ import re
305
+ from helpers import http_get
306
+
307
+ def paginate_listing(base_url, max_pages=10):
308
+ """
309
+ Scrape multiple pages from any itch.io browse URL.
310
+ base_url: https://itch.io/games/top-rated (no ?page= suffix)
311
+ Returns flat list of game dicts.
312
+ Stops when HTTP 404 or no <link rel="next"> found.
313
+ """
314
+ all_games = []
315
+ page = 1
316
+ while page <= max_pages:
317
+ url = base_url if page == 1 else f"{base_url}?page={page}"
318
+ try:
319
+ html = http_get(url)
320
+ except Exception:
321
+ break # 404 = past last page
322
+ all_games.extend(parse_game_cards(html))
323
+ if not re.search(r'<link[^>]+rel="next"[^>]*/>', html):
324
+ break
325
+ page += 1
326
+ return all_games
327
+
328
+ # Confirmed: page 1 has <link href="?page=2" rel="next"/>
329
+ # page 2 has <link rel="prev" href="/games/top-rated"/> and <link rel="next" href="?page=3"/>
330
+ # past last page returns HTTP 404
331
+ # top-rated has at least 200 pages (each 36 games); page 300+ -> 404
332
+ ```
333
+
334
+ ---
335
+
336
+ ## Browse URL patterns
337
+
338
+ All confirmed working via `http_get`:
339
+
340
+ ```python
341
+ BASE = "https://itch.io/games"
342
+
343
+ # Sort orders
344
+ f"{BASE}/top-rated" # all-time top rated (rated by community, 0–5 stars)
345
+ f"{BASE}/newest" # most recently published
346
+ f"{BASE}/featured" # itch.io staff picks
347
+ f"{BASE}/on-sale" # discounted games
348
+ f"{BASE}/free" # free games only
349
+
350
+ # Genre/tag paths (append .xml for RSS)
351
+ f"{BASE}/tag-puzzle" # tag slug — prefix with 'tag-'
352
+ f"{BASE}/genre-action" # genre — prefix with 'genre-' (less common)
353
+
354
+ # Combine: tag + sort via separate pages (no combined URL that survives http_get)
355
+ # Note: https://itch.io/games/top-rated/tag-puzzle -> HTTP 403
356
+ # Note: ?tag= query param does NOT filter server-side (returns same games)
357
+
358
+ # Pagination
359
+ f"{BASE}/top-rated?page=2"
360
+ f"{BASE}/tag-puzzle?page=3"
361
+
362
+ # RSS equivalents (36 items, no pagination needed for small sets)
363
+ f"{BASE}/top-rated.xml"
364
+ f"{BASE}/tag-puzzle.xml"
365
+ f"{BASE}/tag-puzzle.xml?page=2"
366
+
367
+ # Search (54 results/page, no server-side pagination beyond page 1 via http_get)
368
+ "https://itch.io/search?q=platformer"
369
+
370
+ # Author profile
371
+ "https://<author-slug>.itch.io"
372
+ ```
373
+
374
+ ---
375
+
376
+ ## API (requires key)
377
+
378
+ itch.io has an official REST API. A free key is issued per-account with no rate limit published.
379
+ Get one at: `https://itch.io/user/settings/api-keys`
380
+
381
+ Base URL: `https://itch.io/api/1/<key>/`
382
+
383
+ ```python
384
+ import json
385
+ from helpers import http_get
386
+
387
+ ITCH_KEY = "your_api_key_here" # from https://itch.io/user/settings/api-keys
388
+
389
+ def api(path):
390
+ return json.loads(http_get(f"https://itch.io/api/1/{ITCH_KEY}/{path}"))
391
+
392
+ # Authenticated user info
393
+ api("me")
394
+ # -> {"user": {"id": ..., "username": "...", "url": "...", "display_name": "...", ...}}
395
+
396
+ # Games owned by authenticated user
397
+ api("my-games")
398
+ # -> {"games": [{"id": ..., "title": "...", "url": "...", "created_at": "...",
399
+ # "published": true/false, "min_price": 0, ...}, ...]}
400
+
401
+ # Download keys for a game (owner only)
402
+ api("game/434554/download_keys")
403
+
404
+ # Credentials (for authenticated purchases)
405
+ api("game/434554/credentials")
406
+ ```
407
+
408
+ **Error structure:** invalid/missing key returns `{"errors": ["invalid key"]}` with HTTP 200.
409
+ Non-existent endpoints return HTTP 404.
410
+
411
+ **No unauthenticated game lookup API.** `https://itch.io/api/1/x/games` -> HTTP 404.
412
+ Use HTML scraping or RSS for unauthenticated game data.
413
+
414
+ ---
415
+
416
+ ## Gotchas
417
+
418
+ 1. **Attribute order flips page 1 vs 2+.** On page 1, game cards use `class="game_cell ..." data-game_id="..."`. On pages 2+, the order is `data-game_id="..." class="game_cell ..."`. Always match `data-game_id` independently of class ordering.
419
+
420
+ 2. **Ratings absent on tag/genre listing pages.** The `data-tooltip` with rating is often missing from card HTML on `/games/tag-*` pages even though the game has ratings. Fetch the detail page for `aggregateRating` via JSON-LD.
421
+
422
+ 3. **`price_value` absent = Free.** Paid games have `<div class="price_tag meta_tag" title="Pay $7.99 or more..."><div class="price_value">$7.99</div></div>`. Free games have no such element. Default to `'Free'` when absent.
423
+
424
+ 4. **Free-game JSON-LD has no `offers` block.** Only paid games include the `offers` object. For free games, use absence of `offers` as the signal, not presence of `price: 0`.
425
+
426
+ 5. **`/games/top-rated/tag-puzzle` returns HTTP 403.** Cannot combine sort + tag in a path. Use separate `/games/tag-puzzle` (top-rated is the default sort anyway).
427
+
428
+ 6. **`?tag=` query param is ignored server-side.** `https://itch.io/games/top-rated?tag=puzzle` returns the same games as `?top-rated`. Use `/games/tag-puzzle` path instead.
429
+
430
+ 7. **Download/purchase counts are not public.** No count field appears anywhere in the public HTML, JSON-LD, RSS, or unauthenticated API. Game owners see their stats in the dashboard only.
431
+
432
+ 8. **Search beyond page 1 is AJAX-only.** `https://itch.io/search?q=X&page=2` via `http_get` returns the same 54 results as page 1. To get more search results use the browser and scroll/click "load more".
433
+
434
+ 9. **RSS is capped at 36 items per page.** Paginate with `?page=N`. Very high page numbers (300+) return HTTP 404 on browse pages.
435
+
436
+ 10. **Unicode zero-width space in some titles.** `\u200b` (zero-width space) appears at the start of certain titles (e.g. "​Our Life: Beginnings & Always"). Strip with `.replace('\u200b', '').strip()` or `.strip()` alone won't remove it — use `title.replace('\u200b', '').strip()`.