@pencil-agent/nano-pencil 2.0.1 → 2.0.2

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
Files changed (186) hide show
  1. package/README.md +267 -267
  2. package/dist/build-meta.json +3 -3
  3. package/dist/core/export-html/AGENT.md +11 -11
  4. package/dist/core/export-html/template.css +971 -971
  5. package/dist/core/export-html/template.html +54 -54
  6. package/dist/core/model/custom-providers.js +1 -1
  7. package/dist/core/model-registry.js +5 -5
  8. package/dist/extensions/builtin/AGENT.md +115 -115
  9. package/dist/extensions/builtin/browser/AGENT.md +17 -17
  10. package/dist/extensions/builtin/browser/agent-workspace/agent_helpers.py +12 -12
  11. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/amazon/product-search.md +198 -198
  12. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/archive-org/scraping.md +341 -341
  13. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/arxiv/scraping.md +311 -311
  14. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/arxiv-bulk/scraping.md +333 -333
  15. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/atlas/overview.md +70 -70
  16. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/booking-com/scraping.md +578 -578
  17. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/capterra/scraping.md +440 -440
  18. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/centilebrain/generate-estimates.md +110 -110
  19. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/coingecko/scraping.md +325 -325
  20. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/coinmarketcap/scraping.md +463 -463
  21. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/coursera/scraping.md +360 -360
  22. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/craigslist/scraping.md +390 -390
  23. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/crossref/scraping.md +568 -568
  24. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/dev-to/scraping.md +323 -323
  25. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/duckduckgo/scraping.md +349 -349
  26. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/ebay/scraping.md +435 -435
  27. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/etsy/scraping.md +506 -506
  28. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/eventbrite/scraping.md +363 -363
  29. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/expedia/automation.md +168 -168
  30. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/facebook/groups.md +236 -236
  31. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/facebook/pages.md +295 -295
  32. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/framer/editor.md +108 -108
  33. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/fred/scraping.md +493 -493
  34. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/g2/scraping.md +580 -580
  35. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/genius/scraping.md +511 -511
  36. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/github/repo-actions.md +65 -65
  37. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/github/scraping.md +184 -184
  38. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/glassdoor/scraping.md +543 -543
  39. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/gmail/compose.md +122 -122
  40. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/goodreads/scraping.md +461 -461
  41. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/gutenberg/scraping.md +383 -383
  42. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/hackernews/scraping.md +243 -243
  43. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/howlongtobeat/scraping.md +473 -473
  44. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/imdb/scraping.md +271 -271
  45. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/itch-io/scraping.md +436 -436
  46. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/job-boards/indeed-glassdoor.md +1021 -1021
  47. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/letterboxd/scraping.md +349 -349
  48. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/linkedin/invitation-manager.md +109 -109
  49. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/loom/folder-enumeration.md +170 -170
  50. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/macrotrends/scraping.md +537 -537
  51. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/medium/article-hydration.md +120 -120
  52. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/medium/scraping.md +414 -414
  53. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/metacritic/scraping.md +477 -477
  54. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/musicbrainz/scraping.md +478 -478
  55. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/nasa/scraping.md +339 -339
  56. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/news-aggregation/multi-source.md +205 -205
  57. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/open-library/scraping.md +472 -472
  58. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/openalex/scraping.md +470 -470
  59. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/openstreetmap/scraping.md +490 -490
  60. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/package-registries/npm-pypi.md +478 -478
  61. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/polymarket/scraping.md +234 -234
  62. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/producthunt/scraping.md +307 -307
  63. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/pubmed/scraping.md +421 -421
  64. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/quora/scraping.md +364 -364
  65. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/rawg/scraping.md +352 -352
  66. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/reddit/scraping.md +124 -124
  67. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/rest-countries/scraping.md +233 -233
  68. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/sec-edgar/scraping.md +361 -361
  69. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/shopify-admin/README.md +36 -36
  70. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/shopify-admin/embedded-apps.md +72 -72
  71. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/shopify-admin/knowledge-base.md +109 -109
  72. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/shopify-admin/polaris-inputs.md +137 -137
  73. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/soundcloud/scraping.md +362 -362
  74. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/spotify/scraping.md +339 -339
  75. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/stackoverflow/scraping.md +435 -435
  76. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/steam/scraping.md +575 -575
  77. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/substack/scraping.md +338 -338
  78. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/thetechgeeks/pricing.md +52 -52
  79. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/tiktok/upload.md +107 -107
  80. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/tradingview/scraping.md +309 -309
  81. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/trello/boards-and-lists.md +88 -88
  82. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/trustpilot/scraping.md +375 -375
  83. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/walmart/scraping.md +444 -444
  84. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/wayback-machine/scraping.md +306 -306
  85. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/weather/scraping.md +398 -398
  86. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/wellfound/scraping.md +596 -596
  87. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/world-bank/scraping.md +356 -356
  88. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/xiaohongshu/scraping.md +84 -84
  89. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/youtube/scraping.md +418 -418
  90. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/zillow/scraping.md +433 -433
  91. package/dist/extensions/builtin/browser/browser.md +73 -73
  92. package/dist/extensions/builtin/browser/install.md +142 -142
  93. package/dist/extensions/builtin/browser/interaction-skills/connection.md +48 -48
  94. package/dist/extensions/builtin/browser/interaction-skills/cookies.md +3 -3
  95. package/dist/extensions/builtin/browser/interaction-skills/cross-origin-iframes.md +3 -3
  96. package/dist/extensions/builtin/browser/interaction-skills/dialogs.md +64 -64
  97. package/dist/extensions/builtin/browser/interaction-skills/downloads.md +3 -3
  98. package/dist/extensions/builtin/browser/interaction-skills/drag-and-drop.md +3 -3
  99. package/dist/extensions/builtin/browser/interaction-skills/dropdowns.md +3 -3
  100. package/dist/extensions/builtin/browser/interaction-skills/iframes.md +3 -3
  101. package/dist/extensions/builtin/browser/interaction-skills/network-requests.md +3 -3
  102. package/dist/extensions/builtin/browser/interaction-skills/print-as-pdf.md +3 -3
  103. package/dist/extensions/builtin/browser/interaction-skills/profile-sync.md +90 -90
  104. package/dist/extensions/builtin/browser/interaction-skills/screenshots.md +17 -17
  105. package/dist/extensions/builtin/browser/interaction-skills/scrolling.md +3 -3
  106. package/dist/extensions/builtin/browser/interaction-skills/shadow-dom.md +3 -3
  107. package/dist/extensions/builtin/browser/interaction-skills/tabs.md +69 -69
  108. package/dist/extensions/builtin/browser/interaction-skills/uploads.md +1 -1
  109. package/dist/extensions/builtin/browser/interaction-skills/viewport.md +3 -3
  110. package/dist/extensions/builtin/browser/src/browser_harness/AGENT.md +15 -15
  111. package/dist/extensions/builtin/browser/src/browser_harness/__init__.py +8 -8
  112. package/dist/extensions/builtin/browser/src/browser_harness/_ipc.py +90 -90
  113. package/dist/extensions/builtin/browser/src/browser_harness/admin.py +722 -722
  114. package/dist/extensions/builtin/browser/src/browser_harness/daemon.py +328 -328
  115. package/dist/extensions/builtin/browser/src/browser_harness/helpers.py +396 -396
  116. package/dist/extensions/builtin/browser/src/browser_harness/run.py +103 -103
  117. package/dist/extensions/builtin/discipline/skills/brainstorming/SKILL.md +33 -33
  118. package/dist/extensions/builtin/discipline/skills/executing-plans/SKILL.md +25 -25
  119. package/dist/extensions/builtin/discipline/skills/finishing-development-branch/SKILL.md +25 -25
  120. package/dist/extensions/builtin/discipline/skills/receiving-code-review/SKILL.md +22 -22
  121. package/dist/extensions/builtin/discipline/skills/requesting-code-review/SKILL.md +31 -31
  122. package/dist/extensions/builtin/discipline/skills/systematic-debugging/SKILL.md +28 -28
  123. package/dist/extensions/builtin/discipline/skills/test-driven-development/SKILL.md +32 -32
  124. package/dist/extensions/builtin/discipline/skills/using-git-worktrees/SKILL.md +25 -25
  125. package/dist/extensions/builtin/discipline/skills/verification-before-completion/SKILL.md +27 -27
  126. package/dist/extensions/builtin/discipline/skills/writing-plans/SKILL.md +26 -26
  127. package/dist/extensions/builtin/goal/README.md +67 -67
  128. package/dist/extensions/builtin/grub/README.md +112 -112
  129. package/dist/extensions/builtin/link-world/agent-workspace/README.md +16 -16
  130. package/dist/extensions/builtin/link-world/internet-search/internet-search.md +65 -65
  131. package/dist/extensions/builtin/link-world/link-world-agent.md +82 -82
  132. package/dist/extensions/builtin/link-world/linkworld.md +313 -313
  133. package/dist/extensions/builtin/link-world/network-routing/network-routing.md +67 -67
  134. package/dist/extensions/builtin/loop/README.md +92 -92
  135. package/dist/extensions/builtin/mcp/figma-design.md +68 -68
  136. package/dist/extensions/builtin/mcp/mcp-management.md +85 -85
  137. package/dist/extensions/builtin/recap/AGENT.md +15 -15
  138. package/dist/extensions/builtin/sal/README.md +72 -72
  139. package/dist/extensions/builtin/security-audit/README.md +289 -289
  140. package/dist/extensions/builtin/team/AGENT.md +112 -112
  141. package/dist/extensions/builtin/team/TESTING.md +299 -299
  142. package/dist/extensions/builtin/token-save/README.md +56 -56
  143. package/dist/extensions/optional/AGENT.md +10 -10
  144. package/dist/modes/interactive/controllers/input-submit-controller.js +2 -2
  145. package/dist/modes/interactive/controllers/stream-render-controller.js +2 -2
  146. package/dist/modes/interactive/interactive-mode.js +19 -19
  147. package/dist/modes/interactive/theme/dark.json +85 -85
  148. package/dist/modes/interactive/theme/light.json +84 -84
  149. package/dist/modes/interactive/theme/theme-schema.json +335 -335
  150. package/dist/modes/interactive/theme/warm.json +81 -81
  151. package/dist/node_modules/@pencil-agent/ai/dist/cli.js +0 -0
  152. package/dist/node_modules/@pencil-agent/ai/dist/models.generated.js +1 -1
  153. package/docs/ACP/345/215/217/350/256/256/351/233/206/346/210/220/345/274/200/345/217/221/346/226/207/346/241/243.md +851 -0
  154. package/docs/SDK-TESTING.md +364 -0
  155. package/docs/codex-goal-command-impl.md +1055 -1055
  156. package/docs/codex-goal-vs-grub.md +500 -500
  157. package/docs/custom-provider.md +27 -27
  158. package/docs/extensions.md +27 -27
  159. package/docs/keybindings.md +27 -27
  160. package/docs/loop /351/207/215/346/236/204/345/256/214/346/210/220/346/200/273/347/273/223.md" +250 -250
  161. package/docs/loop /351/207/215/346/236/204/345/256/214/346/210/220/346/212/245/345/221/212.md" +122 -122
  162. package/docs/loop /351/207/215/346/236/204/346/226/271/346/241/210.md" +1222 -1222
  163. package/docs/loop /351/207/215/346/236/204/346/226/271/346/241/210/345/256/236/347/216/260/346/212/245/345/221/212.md" +158 -158
  164. package/docs/loop /351/207/215/346/236/204/346/226/271/346/241/210/345/257/271/346/257/224/345/210/206/346/236/220.md" +128 -128
  165. package/docs/loop /351/207/215/346/236/204/350/256/241/345/210/222.md" +320 -320
  166. package/docs/loop-usage-examples.md +214 -214
  167. package/docs/mem-core/346/212/200/346/234/257/346/226/207/346/241/243.md +593 -0
  168. package/docs/models.md +27 -27
  169. package/docs/packages.md +27 -27
  170. package/docs/pi-design-philosophy.md +457 -457
  171. package/docs/planmode.md +1987 -1987
  172. package/docs/prompt-templates.md +27 -27
  173. package/docs/providers.md +27 -27
  174. package/docs/sdk.md +27 -27
  175. package/docs/skills.md +27 -27
  176. package/docs/startup-performance-optimization.md +301 -0
  177. package/docs/themes.md +27 -27
  178. package/docs/tui.md +27 -27
  179. package/docs//350/256/244/347/237/245/345/234/260/345/233/276.md +47 -0
  180. package/package.json +190 -190
  181. package/docs/cc-agent-design.md +0 -1297
  182. package/docs/cc-tui-design.md +0 -1333
  183. package/docs/nanoPencil-/345/255/246/344/271/240/350/256/241/345/210/222.md +0 -170
  184. package/docs/scan-report.md +0 -3820
  185. package/docs//345/257/271/346/240/207Claude-Code.md +0 -1775
  186. package/docs//351/230/277/351/207/214/345/267/264/345/267/264/350/264/242/346/212/245/345/210/206/346/236/220/344/271/246.md +0 -261
@@ -1,436 +1,436 @@
1
- # itch.io — Scraping & Data Extraction
2
-
3
- Field-tested against itch.io on 2026-04-18. All code blocks validated with live requests.
4
-
5
- ---
6
-
7
- ## TL;DR — fastest approaches by task
8
-
9
- | Task | Method | Notes |
10
- |---|---|---|
11
- | Browse listings (36/page) | `http_get` HTML | Works, no key, no bot block |
12
- | Game detail (name, price, rating) | `http_get` + JSON-LD | `<script type="application/ld+json">` Product block |
13
- | Info table (tags, genre, status) | `http_get` + regex on `game_info_panel_widget` | Always present |
14
- | Top N games from any category | RSS `.xml` feed | Cleaner than HTML for bulk |
15
- | API (key endpoints) | `http_get` + key in path | Free keys at itch.io/docs/api |
16
- | Download/purchase counts | Not public | Owners only via dashboard |
17
-
18
- `http_get` works on all itch.io game and browse pages with no extra headers needed.
19
- No Cloudflare, no JS challenge, no CAPTCHA on standard game/browse routes.
20
-
21
- ---
22
-
23
- ## Approach 1 (Fastest for listings): RSS feeds — 36 games per call, clean XML
24
-
25
- Every browse URL has an `.xml` RSS variant. Returns price, pub/update dates, platforms, thumbnail. No HTML parsing.
26
-
27
- ```python
28
- import re
29
- from helpers import http_get
30
-
31
- def parse_rss(url):
32
- """
33
- Parse any itch.io RSS listing feed.
34
- url examples:
35
- https://itch.io/games/top-rated.xml
36
- https://itch.io/games/newest.xml
37
- https://itch.io/games/featured.xml
38
- https://itch.io/games/on-sale.xml
39
- https://itch.io/games/free.xml
40
- https://itch.io/games/tag-puzzle.xml # any tag slug works
41
- https://itch.io/games/top-rated.xml?page=2
42
- """
43
- xml = http_get(url)
44
- items = []
45
- for m in re.finditer(r'<item>(.*?)</item>', xml, re.DOTALL):
46
- ix = m.group(1)
47
- def get(tag, s=ix):
48
- tm = re.search(rf'<{tag}>(.*?)</{tag}>', s, re.DOTALL)
49
- return tm.group(1).strip() if tm else None
50
- items.append({
51
- 'url': get('guid'),
52
- 'title': get('plainTitle'), # clean title, no [tags]
53
- 'price': get('price'), # "$0.00", "$7.99", etc.
54
- 'currency': get('currency'), # "USD"
55
- 'pub_date': get('pubDate'),
56
- 'update_date': get('updateDate'),
57
- 'image': get('imageurl'), # 315x250 thumbnail
58
- 'platforms': {
59
- k: get(k) == 'yes'
60
- for k in ['windows', 'osx', 'linux', 'android', 'html']
61
- if get(k) is not None
62
- },
63
- })
64
- return items
65
-
66
- # Confirmed output:
67
- items = parse_rss("https://itch.io/games/top-rated.xml")
68
- # items[0] -> {
69
- # 'url': 'https://gbpatch.itch.io/our-life',
70
- # 'title': 'Our Life: Beginnings & Always',
71
- # 'price': '$0.00',
72
- # 'currency': 'USD',
73
- # 'pub_date': 'Fri, 07 Jun 2019 23:47:57 GMT',
74
- # 'update_date': 'Sun, 22 May 2022 15:48:27 GMT',
75
- # 'image': 'https://img.itch.zone/aW1nLzcwMTIxNDMucG5n/315x250%23c/BalGQb.png',
76
- # 'platforms': {'windows': True, 'osx': True, 'linux': True, 'android': True},
77
- # }
78
- ```
79
-
80
- **RSS limitations:** no rating score or count. Use HTML scraping (Approach 2) when you need ratings.
81
-
82
- ---
83
-
84
- ## Approach 2: HTML listings — ratings, genre, price, 36 games per page
85
-
86
- ```python
87
- import re
88
- from helpers import http_get
89
-
90
- def parse_game_cards(html):
91
- """
92
- Extract all game cards from any itch.io browse/listing/search/profile HTML page.
93
- Works on:
94
- https://itch.io/games/top-rated
95
- https://itch.io/games/newest
96
- https://itch.io/games/featured
97
- https://itch.io/games/on-sale
98
- https://itch.io/games/free
99
- https://itch.io/games/tag-puzzle (genre/tag path)
100
- https://itch.io/search?q=platformer (search — 54 cards per page)
101
- https://<author>.itch.io (author profile)
102
- All accept ?page=N for pagination.
103
- """
104
- games = []
105
- for m in re.finditer(r'data-game_id="(\d+)"', html):
106
- game_id = m.group(1)
107
- start = m.start()
108
- chunk = html[start:start + 3000]
109
-
110
- # Title + URL — attribute order differs between page 1 and pages 2+
111
- title_m = re.search(
112
- r'class="title game_link"[^>]*href="([^"]+)"[^>]*>([^<]+)</a>', chunk
113
- )
114
- if not title_m:
115
- title_m = re.search(
116
- r'href="([^"]+)"[^>]*class="title game_link"[^>]*>([^<]+)</a>', chunk
117
- )
118
-
119
- rating_m = re.search(
120
- r'data-tooltip="([\d.]+) average rating from ([\d,]+) total ratings"', chunk
121
- )
122
- genre_m = re.search(r'class="game_genre">([^<]+)</div>', chunk)
123
- price_m = re.search(r'class="price_value">([^<]+)</div>', chunk)
124
- desc_m = re.search(r'class="game_text" title="([^"]+)"', chunk)
125
- img_m = re.search(r'data-lazy_src="([^"]+)"', chunk)
126
- platforms = re.findall(r'title="Download for ([^"]+)"', chunk)
127
-
128
- games.append({
129
- 'id': game_id,
130
- 'url': title_m.group(1) if title_m else None,
131
- 'title': title_m.group(2).strip() if title_m else None,
132
- 'rating': float(rating_m.group(1)) if rating_m else None,
133
- 'rating_count': int(rating_m.group(2).replace(',', '')) if rating_m else None,
134
- 'genre': genre_m.group(1) if genre_m else None,
135
- 'price': price_m.group(1) if price_m else 'Free',
136
- 'description': desc_m.group(1) if desc_m else None,
137
- 'thumbnail': img_m.group(1) if img_m else None,
138
- 'platforms': platforms, # ['Windows', 'macOS', 'Linux', 'Android']
139
- })
140
- return games
141
-
142
- # Usage:
143
- html = http_get("https://itch.io/games/top-rated")
144
- games = parse_game_cards(html)
145
- # games[0] -> {
146
- # 'id': '434554', 'url': 'https://gbpatch.itch.io/our-life',
147
- # 'title': 'Our Life: Beginnings & Always',
148
- # 'rating': 4.94, 'rating_count': 7191,
149
- # 'genre': 'Visual Novel', 'price': 'Free',
150
- # 'platforms': ['Windows', 'Linux', 'macOS', 'Android'],
151
- # }
152
-
153
- # Paid game example:
154
- html = http_get("https://itch.io/games/top-rated?page=5")
155
- games = parse_game_cards(html)
156
- # Returns games where price_m captures '$7.99' when present
157
- ```
158
-
159
- ### CSS selector reference (for browser/JS use)
160
-
161
- ```
162
- .game_cell — one card per game
163
- .game_cell[data-game_id] — get game ID from attribute
164
- .game_cell .title.game_link — title text + href
165
- .game_cell .game_rating — rating container
166
- .game_cell .game_rating[data-tooltip] — "4.94 average rating from 7,191 total ratings"
167
- .game_cell .star_fill — inline style width: NN% (rating as percentage of 5)
168
- .game_cell .rating_count — "(7,191)"
169
- .game_cell .game_genre — genre text
170
- .game_cell .price_tag .price_value — price e.g. "$7.99" (absent = Free)
171
- .game_cell .game_text — one-line description (also in title attr)
172
- .game_cell .game_author a — author name + href
173
- .game_cell img.lazy_loaded — thumbnail (src in data-lazy_src before JS runs)
174
- ```
175
-
176
- **Gotcha — attribute order flips on page >= 2.** Page 1 uses `class="..." data-game_id="..."`, page 2+ uses `data-game_id="..." class="..."`. The regex above handles both. If you use a CSS selector engine, `[data-game_id]` is unambiguous.
177
-
178
- **Gotcha — ratings absent on some listing types.** The tag/genre browse pages (e.g. `/games/tag-puzzle`) sometimes omit the rating tooltip on the card even when the game has ratings. Fetch the detail page for the authoritative rating.
179
-
180
- ---
181
-
182
- ## Approach 3: Game detail page — JSON-LD Product schema
183
-
184
- The cleanest source for individual game data. All confirmed fields:
185
-
186
- ```python
187
- import json, re
188
- from helpers import http_get
189
-
190
- def extract_game_detail(url):
191
- """
192
- Fetch full metadata for a single itch.io game.
193
- url format: https://<author>.itch.io/<game-slug>
194
- """
195
- html = http_get(url)
196
-
197
- # --- JSON-LD (always present, covers name/description/price/rating) ---
198
- ld_product = None
199
- for block in re.findall(
200
- r'<script[^>]*type="application/ld\+json"[^>]*>(.*?)</script>',
201
- html, re.DOTALL
202
- ):
203
- ld = json.loads(block.strip())
204
- if ld.get('@type') == 'Product':
205
- ld_product = ld
206
- break
207
-
208
- # --- Info panel table (Status, Platforms, Genre, Tags, Author, etc.) ---
209
- info = {}
210
- panel_m = re.search(
211
- r'class="game_info_panel_widget[^"]*"[^>]*><table>(.*?)</table>',
212
- html, re.DOTALL
213
- )
214
- if panel_m:
215
- for row in re.finditer(
216
- r'<tr><td>([^<]+)</td><td>(.*?)</td></tr>',
217
- panel_m.group(1), re.DOTALL
218
- ):
219
- key = row.group(1).strip()
220
- val = re.sub(r'<[^>]+>', '', row.group(2)).strip()
221
- # Multi-value fields become lists (Tags, Platforms, Genre, Links)
222
- info[key] = [v.strip() for v in val.split(',')] if ',' in val else val
223
-
224
- # --- Cover image ---
225
- cover_m = re.search(r'<meta property="og:image" content="([^"]+)"', html)
226
-
227
- offers = (ld_product or {}).get('offers', {})
228
- agg = (ld_product or {}).get('aggregateRating', {})
229
-
230
- return {
231
- 'url': url,
232
- 'name': (ld_product or {}).get('name'),
233
- 'description': (ld_product or {}).get('description'),
234
- 'price': offers.get('price'), # "0.00" for free, "7.99" for paid
235
- 'currency': offers.get('priceCurrency'), # "USD"
236
- 'rating': agg.get('ratingValue'), # "4.9" string
237
- 'rating_count': agg.get('ratingCount'), # int
238
- 'cover': cover_m.group(1) if cover_m else None,
239
- 'info': info,
240
- }
241
-
242
- # Free game:
243
- r = extract_game_detail("https://gbpatch.itch.io/our-life")
244
- # {
245
- # 'name': 'Our Life: Beginnings & Always',
246
- # 'description': 'Grow from childhood to adulthood with the lonely boy next door...',
247
- # 'price': None, 'currency': None, <- no 'offers' block for free games
248
- # 'rating': '4.9', 'rating_count': 7191,
249
- # 'cover': 'https://img.itch.zone/aW1hZ2Uv.../347x500/7HqrvV.jpg',
250
- # 'info': {
251
- # 'Status': 'Released',
252
- # 'Platforms': ['Windows', 'macOS', 'Linux', 'Android'],
253
- # 'Rating': 'Rated 4.9 out of 5 stars(7,191 total ratings)',
254
- # 'Author': 'GBPatch',
255
- # 'Genre': ['Visual Novel', 'Interactive Fiction'],
256
- # 'Tags': ['Amare', 'Comedy', 'Dating Sim', 'Gay', 'LGBT', ...],
257
- # 'Links': 'Steam',
258
- # }
259
- # }
260
-
261
- # Paid game:
262
- r = extract_game_detail("https://adamgryu.itch.io/a-short-hike")
263
- # {
264
- # 'name': 'A Short Hike',
265
- # 'price': '7.99', 'currency': 'USD',
266
- # 'rating': '4.9', 'rating_count': 4307,
267
- # 'info': {
268
- # 'Status': 'Released',
269
- # 'Platforms': ['Windows', 'macOS', 'Linux'],
270
- # 'Release date': 'Jul 30, 2019',
271
- # 'Genre': ['Adventure', 'Platformer'],
272
- # 'Made with': 'Unity',
273
- # 'Tags': ['3D', 'Atmospheric', 'Cute', 'Relaxing', 'Short', ...],
274
- # 'Average session': 'About an hour',
275
- # 'Languages': ['English', 'Spanish; Latin America', 'French', ...],
276
- # 'Inputs': ['Keyboard', 'Mouse', 'Xbox controller', ...],
277
- # 'Accessibility': ['Subtitles', 'Configurable controls'],
278
- # 'Links': ['Steam', 'Homepage', 'Soundtrack', 'Twitter/X'],
279
- # }
280
- # }
281
- ```
282
-
283
- **JSON-LD available fields:**
284
-
285
- | Field | Free game | Paid game |
286
- |---|---|---|
287
- | `@type` | `Product` | `Product` |
288
- | `name` | yes | yes |
289
- | `description` | yes | yes |
290
- | `aggregateRating.ratingValue` | yes | yes |
291
- | `aggregateRating.ratingCount` | yes | yes |
292
- | `offers.price` | absent | yes ("7.99") |
293
- | `offers.priceCurrency` | absent | yes ("USD") |
294
- | `offers.seller.name` | absent | yes (author name) |
295
- | `offers.seller.url` | absent | yes (author profile URL) |
296
-
297
- ---
298
-
299
- ## Pagination
300
-
301
- Browse pages: `?page=N`. Detect end of results by HTTP 404 (page too high) or absent `<link rel="next">`.
302
-
303
- ```python
304
- import re
305
- from helpers import http_get
306
-
307
- def paginate_listing(base_url, max_pages=10):
308
- """
309
- Scrape multiple pages from any itch.io browse URL.
310
- base_url: https://itch.io/games/top-rated (no ?page= suffix)
311
- Returns flat list of game dicts.
312
- Stops when HTTP 404 or no <link rel="next"> found.
313
- """
314
- all_games = []
315
- page = 1
316
- while page <= max_pages:
317
- url = base_url if page == 1 else f"{base_url}?page={page}"
318
- try:
319
- html = http_get(url)
320
- except Exception:
321
- break # 404 = past last page
322
- all_games.extend(parse_game_cards(html))
323
- if not re.search(r'<link[^>]+rel="next"[^>]*/>', html):
324
- break
325
- page += 1
326
- return all_games
327
-
328
- # Confirmed: page 1 has <link href="?page=2" rel="next"/>
329
- # page 2 has <link rel="prev" href="/games/top-rated"/> and <link rel="next" href="?page=3"/>
330
- # past last page returns HTTP 404
331
- # top-rated has at least 200 pages (each 36 games); page 300+ -> 404
332
- ```
333
-
334
- ---
335
-
336
- ## Browse URL patterns
337
-
338
- All confirmed working via `http_get`:
339
-
340
- ```python
341
- BASE = "https://itch.io/games"
342
-
343
- # Sort orders
344
- f"{BASE}/top-rated" # all-time top rated (rated by community, 0–5 stars)
345
- f"{BASE}/newest" # most recently published
346
- f"{BASE}/featured" # itch.io staff picks
347
- f"{BASE}/on-sale" # discounted games
348
- f"{BASE}/free" # free games only
349
-
350
- # Genre/tag paths (append .xml for RSS)
351
- f"{BASE}/tag-puzzle" # tag slug — prefix with 'tag-'
352
- f"{BASE}/genre-action" # genre — prefix with 'genre-' (less common)
353
-
354
- # Combine: tag + sort via separate pages (no combined URL that survives http_get)
355
- # Note: https://itch.io/games/top-rated/tag-puzzle -> HTTP 403
356
- # Note: ?tag= query param does NOT filter server-side (returns same games)
357
-
358
- # Pagination
359
- f"{BASE}/top-rated?page=2"
360
- f"{BASE}/tag-puzzle?page=3"
361
-
362
- # RSS equivalents (36 items, no pagination needed for small sets)
363
- f"{BASE}/top-rated.xml"
364
- f"{BASE}/tag-puzzle.xml"
365
- f"{BASE}/tag-puzzle.xml?page=2"
366
-
367
- # Search (54 results/page, no server-side pagination beyond page 1 via http_get)
368
- "https://itch.io/search?q=platformer"
369
-
370
- # Author profile
371
- "https://<author-slug>.itch.io"
372
- ```
373
-
374
- ---
375
-
376
- ## API (requires key)
377
-
378
- itch.io has an official REST API. A free key is issued per-account with no rate limit published.
379
- Get one at: `https://itch.io/user/settings/api-keys`
380
-
381
- Base URL: `https://itch.io/api/1/<key>/`
382
-
383
- ```python
384
- import json
385
- from helpers import http_get
386
-
387
- ITCH_KEY = "your_api_key_here" # from https://itch.io/user/settings/api-keys
388
-
389
- def api(path):
390
- return json.loads(http_get(f"https://itch.io/api/1/{ITCH_KEY}/{path}"))
391
-
392
- # Authenticated user info
393
- api("me")
394
- # -> {"user": {"id": ..., "username": "...", "url": "...", "display_name": "...", ...}}
395
-
396
- # Games owned by authenticated user
397
- api("my-games")
398
- # -> {"games": [{"id": ..., "title": "...", "url": "...", "created_at": "...",
399
- # "published": true/false, "min_price": 0, ...}, ...]}
400
-
401
- # Download keys for a game (owner only)
402
- api("game/434554/download_keys")
403
-
404
- # Credentials (for authenticated purchases)
405
- api("game/434554/credentials")
406
- ```
407
-
408
- **Error structure:** invalid/missing key returns `{"errors": ["invalid key"]}` with HTTP 200.
409
- Non-existent endpoints return HTTP 404.
410
-
411
- **No unauthenticated game lookup API.** `https://itch.io/api/1/x/games` -> HTTP 404.
412
- Use HTML scraping or RSS for unauthenticated game data.
413
-
414
- ---
415
-
416
- ## Gotchas
417
-
418
- 1. **Attribute order flips page 1 vs 2+.** On page 1, game cards use `class="game_cell ..." data-game_id="..."`. On pages 2+, the order is `data-game_id="..." class="game_cell ..."`. Always match `data-game_id` independently of class ordering.
419
-
420
- 2. **Ratings absent on tag/genre listing pages.** The `data-tooltip` with rating is often missing from card HTML on `/games/tag-*` pages even though the game has ratings. Fetch the detail page for `aggregateRating` via JSON-LD.
421
-
422
- 3. **`price_value` absent = Free.** Paid games have `<div class="price_tag meta_tag" title="Pay $7.99 or more..."><div class="price_value">$7.99</div></div>`. Free games have no such element. Default to `'Free'` when absent.
423
-
424
- 4. **Free-game JSON-LD has no `offers` block.** Only paid games include the `offers` object. For free games, use absence of `offers` as the signal, not presence of `price: 0`.
425
-
426
- 5. **`/games/top-rated/tag-puzzle` returns HTTP 403.** Cannot combine sort + tag in a path. Use separate `/games/tag-puzzle` (top-rated is the default sort anyway).
427
-
428
- 6. **`?tag=` query param is ignored server-side.** `https://itch.io/games/top-rated?tag=puzzle` returns the same games as `?top-rated`. Use `/games/tag-puzzle` path instead.
429
-
430
- 7. **Download/purchase counts are not public.** No count field appears anywhere in the public HTML, JSON-LD, RSS, or unauthenticated API. Game owners see their stats in the dashboard only.
431
-
432
- 8. **Search beyond page 1 is AJAX-only.** `https://itch.io/search?q=X&page=2` via `http_get` returns the same 54 results as page 1. To get more search results use the browser and scroll/click "load more".
433
-
434
- 9. **RSS is capped at 36 items per page.** Paginate with `?page=N`. Very high page numbers (300+) return HTTP 404 on browse pages.
435
-
436
- 10. **Unicode zero-width space in some titles.** `\u200b` (zero-width space) appears at the start of certain titles (e.g. "​Our Life: Beginnings & Always"). Strip with `.replace('\u200b', '').strip()` or `.strip()` alone won't remove it — use `title.replace('\u200b', '').strip()`.
1
+ # itch.io — Scraping & Data Extraction
2
+
3
+ Field-tested against itch.io on 2026-04-18. All code blocks validated with live requests.
4
+
5
+ ---
6
+
7
+ ## TL;DR — fastest approaches by task
8
+
9
+ | Task | Method | Notes |
10
+ |---|---|---|
11
+ | Browse listings (36/page) | `http_get` HTML | Works, no key, no bot block |
12
+ | Game detail (name, price, rating) | `http_get` + JSON-LD | `<script type="application/ld+json">` Product block |
13
+ | Info table (tags, genre, status) | `http_get` + regex on `game_info_panel_widget` | Always present |
14
+ | Top N games from any category | RSS `.xml` feed | Cleaner than HTML for bulk |
15
+ | API (key endpoints) | `http_get` + key in path | Free keys at itch.io/docs/api |
16
+ | Download/purchase counts | Not public | Owners only via dashboard |
17
+
18
+ `http_get` works on all itch.io game and browse pages with no extra headers needed.
19
+ No Cloudflare, no JS challenge, no CAPTCHA on standard game/browse routes.
20
+
21
+ ---
22
+
23
+ ## Approach 1 (Fastest for listings): RSS feeds — 36 games per call, clean XML
24
+
25
+ Every browse URL has an `.xml` RSS variant. Returns price, pub/update dates, platforms, thumbnail. No HTML parsing.
26
+
27
+ ```python
28
+ import re
29
+ from helpers import http_get
30
+
31
+ def parse_rss(url):
32
+ """
33
+ Parse any itch.io RSS listing feed.
34
+ url examples:
35
+ https://itch.io/games/top-rated.xml
36
+ https://itch.io/games/newest.xml
37
+ https://itch.io/games/featured.xml
38
+ https://itch.io/games/on-sale.xml
39
+ https://itch.io/games/free.xml
40
+ https://itch.io/games/tag-puzzle.xml # any tag slug works
41
+ https://itch.io/games/top-rated.xml?page=2
42
+ """
43
+ xml = http_get(url)
44
+ items = []
45
+ for m in re.finditer(r'<item>(.*?)</item>', xml, re.DOTALL):
46
+ ix = m.group(1)
47
+ def get(tag, s=ix):
48
+ tm = re.search(rf'<{tag}>(.*?)</{tag}>', s, re.DOTALL)
49
+ return tm.group(1).strip() if tm else None
50
+ items.append({
51
+ 'url': get('guid'),
52
+ 'title': get('plainTitle'), # clean title, no [tags]
53
+ 'price': get('price'), # "$0.00", "$7.99", etc.
54
+ 'currency': get('currency'), # "USD"
55
+ 'pub_date': get('pubDate'),
56
+ 'update_date': get('updateDate'),
57
+ 'image': get('imageurl'), # 315x250 thumbnail
58
+ 'platforms': {
59
+ k: get(k) == 'yes'
60
+ for k in ['windows', 'osx', 'linux', 'android', 'html']
61
+ if get(k) is not None
62
+ },
63
+ })
64
+ return items
65
+
66
+ # Confirmed output:
67
+ items = parse_rss("https://itch.io/games/top-rated.xml")
68
+ # items[0] -> {
69
+ # 'url': 'https://gbpatch.itch.io/our-life',
70
+ # 'title': 'Our Life: Beginnings & Always',
71
+ # 'price': '$0.00',
72
+ # 'currency': 'USD',
73
+ # 'pub_date': 'Fri, 07 Jun 2019 23:47:57 GMT',
74
+ # 'update_date': 'Sun, 22 May 2022 15:48:27 GMT',
75
+ # 'image': 'https://img.itch.zone/aW1nLzcwMTIxNDMucG5n/315x250%23c/BalGQb.png',
76
+ # 'platforms': {'windows': True, 'osx': True, 'linux': True, 'android': True},
77
+ # }
78
+ ```
79
+
80
+ **RSS limitations:** no rating score or count. Use HTML scraping (Approach 2) when you need ratings.
81
+
82
+ ---
83
+
84
+ ## Approach 2: HTML listings — ratings, genre, price, 36 games per page
85
+
86
+ ```python
87
+ import re
88
+ from helpers import http_get
89
+
90
+ def parse_game_cards(html):
91
+ """
92
+ Extract all game cards from any itch.io browse/listing/search/profile HTML page.
93
+ Works on:
94
+ https://itch.io/games/top-rated
95
+ https://itch.io/games/newest
96
+ https://itch.io/games/featured
97
+ https://itch.io/games/on-sale
98
+ https://itch.io/games/free
99
+ https://itch.io/games/tag-puzzle (genre/tag path)
100
+ https://itch.io/search?q=platformer (search — 54 cards per page)
101
+ https://<author>.itch.io (author profile)
102
+ All accept ?page=N for pagination.
103
+ """
104
+ games = []
105
+ for m in re.finditer(r'data-game_id="(\d+)"', html):
106
+ game_id = m.group(1)
107
+ start = m.start()
108
+ chunk = html[start:start + 3000]
109
+
110
+ # Title + URL — attribute order differs between page 1 and pages 2+
111
+ title_m = re.search(
112
+ r'class="title game_link"[^>]*href="([^"]+)"[^>]*>([^<]+)</a>', chunk
113
+ )
114
+ if not title_m:
115
+ title_m = re.search(
116
+ r'href="([^"]+)"[^>]*class="title game_link"[^>]*>([^<]+)</a>', chunk
117
+ )
118
+
119
+ rating_m = re.search(
120
+ r'data-tooltip="([\d.]+) average rating from ([\d,]+) total ratings"', chunk
121
+ )
122
+ genre_m = re.search(r'class="game_genre">([^<]+)</div>', chunk)
123
+ price_m = re.search(r'class="price_value">([^<]+)</div>', chunk)
124
+ desc_m = re.search(r'class="game_text" title="([^"]+)"', chunk)
125
+ img_m = re.search(r'data-lazy_src="([^"]+)"', chunk)
126
+ platforms = re.findall(r'title="Download for ([^"]+)"', chunk)
127
+
128
+ games.append({
129
+ 'id': game_id,
130
+ 'url': title_m.group(1) if title_m else None,
131
+ 'title': title_m.group(2).strip() if title_m else None,
132
+ 'rating': float(rating_m.group(1)) if rating_m else None,
133
+ 'rating_count': int(rating_m.group(2).replace(',', '')) if rating_m else None,
134
+ 'genre': genre_m.group(1) if genre_m else None,
135
+ 'price': price_m.group(1) if price_m else 'Free',
136
+ 'description': desc_m.group(1) if desc_m else None,
137
+ 'thumbnail': img_m.group(1) if img_m else None,
138
+ 'platforms': platforms, # ['Windows', 'macOS', 'Linux', 'Android']
139
+ })
140
+ return games
141
+
142
+ # Usage:
143
+ html = http_get("https://itch.io/games/top-rated")
144
+ games = parse_game_cards(html)
145
+ # games[0] -> {
146
+ # 'id': '434554', 'url': 'https://gbpatch.itch.io/our-life',
147
+ # 'title': 'Our Life: Beginnings & Always',
148
+ # 'rating': 4.94, 'rating_count': 7191,
149
+ # 'genre': 'Visual Novel', 'price': 'Free',
150
+ # 'platforms': ['Windows', 'Linux', 'macOS', 'Android'],
151
+ # }
152
+
153
+ # Paid game example:
154
+ html = http_get("https://itch.io/games/top-rated?page=5")
155
+ games = parse_game_cards(html)
156
+ # Returns games where price_m captures '$7.99' when present
157
+ ```
158
+
159
+ ### CSS selector reference (for browser/JS use)
160
+
161
+ ```
162
+ .game_cell — one card per game
163
+ .game_cell[data-game_id] — get game ID from attribute
164
+ .game_cell .title.game_link — title text + href
165
+ .game_cell .game_rating — rating container
166
+ .game_cell .game_rating[data-tooltip] — "4.94 average rating from 7,191 total ratings"
167
+ .game_cell .star_fill — inline style width: NN% (rating as percentage of 5)
168
+ .game_cell .rating_count — "(7,191)"
169
+ .game_cell .game_genre — genre text
170
+ .game_cell .price_tag .price_value — price e.g. "$7.99" (absent = Free)
171
+ .game_cell .game_text — one-line description (also in title attr)
172
+ .game_cell .game_author a — author name + href
173
+ .game_cell img.lazy_loaded — thumbnail (src in data-lazy_src before JS runs)
174
+ ```
175
+
176
+ **Gotcha — attribute order flips on page >= 2.** Page 1 uses `class="..." data-game_id="..."`, page 2+ uses `data-game_id="..." class="..."`. The regex above handles both. If you use a CSS selector engine, `[data-game_id]` is unambiguous.
177
+
178
+ **Gotcha — ratings absent on some listing types.** The tag/genre browse pages (e.g. `/games/tag-puzzle`) sometimes omit the rating tooltip on the card even when the game has ratings. Fetch the detail page for the authoritative rating.
179
+
180
+ ---
181
+
182
+ ## Approach 3: Game detail page — JSON-LD Product schema
183
+
184
+ The cleanest source for individual game data. All confirmed fields:
185
+
186
+ ```python
187
+ import json, re
188
+ from helpers import http_get
189
+
190
+ def extract_game_detail(url):
191
+ """
192
+ Fetch full metadata for a single itch.io game.
193
+ url format: https://<author>.itch.io/<game-slug>
194
+ """
195
+ html = http_get(url)
196
+
197
+ # --- JSON-LD (always present, covers name/description/price/rating) ---
198
+ ld_product = None
199
+ for block in re.findall(
200
+ r'<script[^>]*type="application/ld\+json"[^>]*>(.*?)</script>',
201
+ html, re.DOTALL
202
+ ):
203
+ ld = json.loads(block.strip())
204
+ if ld.get('@type') == 'Product':
205
+ ld_product = ld
206
+ break
207
+
208
+ # --- Info panel table (Status, Platforms, Genre, Tags, Author, etc.) ---
209
+ info = {}
210
+ panel_m = re.search(
211
+ r'class="game_info_panel_widget[^"]*"[^>]*><table>(.*?)</table>',
212
+ html, re.DOTALL
213
+ )
214
+ if panel_m:
215
+ for row in re.finditer(
216
+ r'<tr><td>([^<]+)</td><td>(.*?)</td></tr>',
217
+ panel_m.group(1), re.DOTALL
218
+ ):
219
+ key = row.group(1).strip()
220
+ val = re.sub(r'<[^>]+>', '', row.group(2)).strip()
221
+ # Multi-value fields become lists (Tags, Platforms, Genre, Links)
222
+ info[key] = [v.strip() for v in val.split(',')] if ',' in val else val
223
+
224
+ # --- Cover image ---
225
+ cover_m = re.search(r'<meta property="og:image" content="([^"]+)"', html)
226
+
227
+ offers = (ld_product or {}).get('offers', {})
228
+ agg = (ld_product or {}).get('aggregateRating', {})
229
+
230
+ return {
231
+ 'url': url,
232
+ 'name': (ld_product or {}).get('name'),
233
+ 'description': (ld_product or {}).get('description'),
234
+ 'price': offers.get('price'), # "0.00" for free, "7.99" for paid
235
+ 'currency': offers.get('priceCurrency'), # "USD"
236
+ 'rating': agg.get('ratingValue'), # "4.9" string
237
+ 'rating_count': agg.get('ratingCount'), # int
238
+ 'cover': cover_m.group(1) if cover_m else None,
239
+ 'info': info,
240
+ }
241
+
242
+ # Free game:
243
+ r = extract_game_detail("https://gbpatch.itch.io/our-life")
244
+ # {
245
+ # 'name': 'Our Life: Beginnings & Always',
246
+ # 'description': 'Grow from childhood to adulthood with the lonely boy next door...',
247
+ # 'price': None, 'currency': None, <- no 'offers' block for free games
248
+ # 'rating': '4.9', 'rating_count': 7191,
249
+ # 'cover': 'https://img.itch.zone/aW1hZ2Uv.../347x500/7HqrvV.jpg',
250
+ # 'info': {
251
+ # 'Status': 'Released',
252
+ # 'Platforms': ['Windows', 'macOS', 'Linux', 'Android'],
253
+ # 'Rating': 'Rated 4.9 out of 5 stars(7,191 total ratings)',
254
+ # 'Author': 'GBPatch',
255
+ # 'Genre': ['Visual Novel', 'Interactive Fiction'],
256
+ # 'Tags': ['Amare', 'Comedy', 'Dating Sim', 'Gay', 'LGBT', ...],
257
+ # 'Links': 'Steam',
258
+ # }
259
+ # }
260
+
261
+ # Paid game:
262
+ r = extract_game_detail("https://adamgryu.itch.io/a-short-hike")
263
+ # {
264
+ # 'name': 'A Short Hike',
265
+ # 'price': '7.99', 'currency': 'USD',
266
+ # 'rating': '4.9', 'rating_count': 4307,
267
+ # 'info': {
268
+ # 'Status': 'Released',
269
+ # 'Platforms': ['Windows', 'macOS', 'Linux'],
270
+ # 'Release date': 'Jul 30, 2019',
271
+ # 'Genre': ['Adventure', 'Platformer'],
272
+ # 'Made with': 'Unity',
273
+ # 'Tags': ['3D', 'Atmospheric', 'Cute', 'Relaxing', 'Short', ...],
274
+ # 'Average session': 'About an hour',
275
+ # 'Languages': ['English', 'Spanish; Latin America', 'French', ...],
276
+ # 'Inputs': ['Keyboard', 'Mouse', 'Xbox controller', ...],
277
+ # 'Accessibility': ['Subtitles', 'Configurable controls'],
278
+ # 'Links': ['Steam', 'Homepage', 'Soundtrack', 'Twitter/X'],
279
+ # }
280
+ # }
281
+ ```
282
+
283
+ **JSON-LD available fields:**
284
+
285
+ | Field | Free game | Paid game |
286
+ |---|---|---|
287
+ | `@type` | `Product` | `Product` |
288
+ | `name` | yes | yes |
289
+ | `description` | yes | yes |
290
+ | `aggregateRating.ratingValue` | yes | yes |
291
+ | `aggregateRating.ratingCount` | yes | yes |
292
+ | `offers.price` | absent | yes ("7.99") |
293
+ | `offers.priceCurrency` | absent | yes ("USD") |
294
+ | `offers.seller.name` | absent | yes (author name) |
295
+ | `offers.seller.url` | absent | yes (author profile URL) |
296
+
297
+ ---
298
+
299
+ ## Pagination
300
+
301
+ Browse pages: `?page=N`. Detect end of results by HTTP 404 (page too high) or absent `<link rel="next">`.
302
+
303
+ ```python
304
+ import re
305
+ from helpers import http_get
306
+
307
+ def paginate_listing(base_url, max_pages=10):
308
+ """
309
+ Scrape multiple pages from any itch.io browse URL.
310
+ base_url: https://itch.io/games/top-rated (no ?page= suffix)
311
+ Returns flat list of game dicts.
312
+ Stops when HTTP 404 or no <link rel="next"> found.
313
+ """
314
+ all_games = []
315
+ page = 1
316
+ while page <= max_pages:
317
+ url = base_url if page == 1 else f"{base_url}?page={page}"
318
+ try:
319
+ html = http_get(url)
320
+ except Exception:
321
+ break # 404 = past last page
322
+ all_games.extend(parse_game_cards(html))
323
+ if not re.search(r'<link[^>]+rel="next"[^>]*/>', html):
324
+ break
325
+ page += 1
326
+ return all_games
327
+
328
+ # Confirmed: page 1 has <link href="?page=2" rel="next"/>
329
+ # page 2 has <link rel="prev" href="/games/top-rated"/> and <link rel="next" href="?page=3"/>
330
+ # past last page returns HTTP 404
331
+ # top-rated has at least 200 pages (each 36 games); page 300+ -> 404
332
+ ```
333
+
334
+ ---
335
+
336
+ ## Browse URL patterns
337
+
338
+ All confirmed working via `http_get`:
339
+
340
+ ```python
341
+ BASE = "https://itch.io/games"
342
+
343
+ # Sort orders
344
+ f"{BASE}/top-rated" # all-time top rated (rated by community, 0–5 stars)
345
+ f"{BASE}/newest" # most recently published
346
+ f"{BASE}/featured" # itch.io staff picks
347
+ f"{BASE}/on-sale" # discounted games
348
+ f"{BASE}/free" # free games only
349
+
350
+ # Genre/tag paths (append .xml for RSS)
351
+ f"{BASE}/tag-puzzle" # tag slug — prefix with 'tag-'
352
+ f"{BASE}/genre-action" # genre — prefix with 'genre-' (less common)
353
+
354
+ # Combine: tag + sort via separate pages (no combined URL that survives http_get)
355
+ # Note: https://itch.io/games/top-rated/tag-puzzle -> HTTP 403
356
+ # Note: ?tag= query param does NOT filter server-side (returns same games)
357
+
358
+ # Pagination
359
+ f"{BASE}/top-rated?page=2"
360
+ f"{BASE}/tag-puzzle?page=3"
361
+
362
+ # RSS equivalents (36 items, no pagination needed for small sets)
363
+ f"{BASE}/top-rated.xml"
364
+ f"{BASE}/tag-puzzle.xml"
365
+ f"{BASE}/tag-puzzle.xml?page=2"
366
+
367
+ # Search (54 results/page, no server-side pagination beyond page 1 via http_get)
368
+ "https://itch.io/search?q=platformer"
369
+
370
+ # Author profile
371
+ "https://<author-slug>.itch.io"
372
+ ```
373
+
374
+ ---
375
+
376
+ ## API (requires key)
377
+
378
+ itch.io has an official REST API. A free key is issued per-account with no rate limit published.
379
+ Get one at: `https://itch.io/user/settings/api-keys`
380
+
381
+ Base URL: `https://itch.io/api/1/<key>/`
382
+
383
+ ```python
384
+ import json
385
+ from helpers import http_get
386
+
387
+ ITCH_KEY = "your_api_key_here" # from https://itch.io/user/settings/api-keys
388
+
389
+ def api(path):
390
+ return json.loads(http_get(f"https://itch.io/api/1/{ITCH_KEY}/{path}"))
391
+
392
+ # Authenticated user info
393
+ api("me")
394
+ # -> {"user": {"id": ..., "username": "...", "url": "...", "display_name": "...", ...}}
395
+
396
+ # Games owned by authenticated user
397
+ api("my-games")
398
+ # -> {"games": [{"id": ..., "title": "...", "url": "...", "created_at": "...",
399
+ # "published": true/false, "min_price": 0, ...}, ...]}
400
+
401
+ # Download keys for a game (owner only)
402
+ api("game/434554/download_keys")
403
+
404
+ # Credentials (for authenticated purchases)
405
+ api("game/434554/credentials")
406
+ ```
407
+
408
+ **Error structure:** invalid/missing key returns `{"errors": ["invalid key"]}` with HTTP 200.
409
+ Non-existent endpoints return HTTP 404.
410
+
411
+ **No unauthenticated game lookup API.** `https://itch.io/api/1/x/games` -> HTTP 404.
412
+ Use HTML scraping or RSS for unauthenticated game data.
413
+
414
+ ---
415
+
416
+ ## Gotchas
417
+
418
+ 1. **Attribute order flips page 1 vs 2+.** On page 1, game cards use `class="game_cell ..." data-game_id="..."`. On pages 2+, the order is `data-game_id="..." class="game_cell ..."`. Always match `data-game_id` independently of class ordering.
419
+
420
+ 2. **Ratings absent on tag/genre listing pages.** The `data-tooltip` with rating is often missing from card HTML on `/games/tag-*` pages even though the game has ratings. Fetch the detail page for `aggregateRating` via JSON-LD.
421
+
422
+ 3. **`price_value` absent = Free.** Paid games have `<div class="price_tag meta_tag" title="Pay $7.99 or more..."><div class="price_value">$7.99</div></div>`. Free games have no such element. Default to `'Free'` when absent.
423
+
424
+ 4. **Free-game JSON-LD has no `offers` block.** Only paid games include the `offers` object. For free games, use absence of `offers` as the signal, not presence of `price: 0`.
425
+
426
+ 5. **`/games/top-rated/tag-puzzle` returns HTTP 403.** Cannot combine sort + tag in a path. Use separate `/games/tag-puzzle` (top-rated is the default sort anyway).
427
+
428
+ 6. **`?tag=` query param is ignored server-side.** `https://itch.io/games/top-rated?tag=puzzle` returns the same games as `?top-rated`. Use `/games/tag-puzzle` path instead.
429
+
430
+ 7. **Download/purchase counts are not public.** No count field appears anywhere in the public HTML, JSON-LD, RSS, or unauthenticated API. Game owners see their stats in the dashboard only.
431
+
432
+ 8. **Search beyond page 1 is AJAX-only.** `https://itch.io/search?q=X&page=2` via `http_get` returns the same 54 results as page 1. To get more search results use the browser and scroll/click "load more".
433
+
434
+ 9. **RSS is capped at 36 items per page.** Paginate with `?page=N`. Very high page numbers (300+) return HTTP 404 on browse pages.
435
+
436
+ 10. **Unicode zero-width space in some titles.** `\u200b` (zero-width space) appears at the start of certain titles (e.g. "​Our Life: Beginnings & Always"). Strip with `.replace('\u200b', '').strip()` or `.strip()` alone won't remove it — use `title.replace('\u200b', '').strip()`.