@pencil-agent/nano-pencil 2.0.1 → 2.0.2
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- package/README.md +267 -267
- package/dist/build-meta.json +3 -3
- package/dist/core/export-html/AGENT.md +11 -11
- package/dist/core/export-html/template.css +971 -971
- package/dist/core/export-html/template.html +54 -54
- package/dist/core/model/custom-providers.js +1 -1
- package/dist/core/model-registry.js +5 -5
- package/dist/extensions/builtin/AGENT.md +115 -115
- package/dist/extensions/builtin/browser/AGENT.md +17 -17
- package/dist/extensions/builtin/browser/agent-workspace/agent_helpers.py +12 -12
- package/dist/extensions/builtin/browser/agent-workspace/domain-skills/amazon/product-search.md +198 -198
- package/dist/extensions/builtin/browser/agent-workspace/domain-skills/archive-org/scraping.md +341 -341
- package/dist/extensions/builtin/browser/agent-workspace/domain-skills/arxiv/scraping.md +311 -311
- package/dist/extensions/builtin/browser/agent-workspace/domain-skills/arxiv-bulk/scraping.md +333 -333
- package/dist/extensions/builtin/browser/agent-workspace/domain-skills/atlas/overview.md +70 -70
- package/dist/extensions/builtin/browser/agent-workspace/domain-skills/booking-com/scraping.md +578 -578
- package/dist/extensions/builtin/browser/agent-workspace/domain-skills/capterra/scraping.md +440 -440
- package/dist/extensions/builtin/browser/agent-workspace/domain-skills/centilebrain/generate-estimates.md +110 -110
- package/dist/extensions/builtin/browser/agent-workspace/domain-skills/coingecko/scraping.md +325 -325
- package/dist/extensions/builtin/browser/agent-workspace/domain-skills/coinmarketcap/scraping.md +463 -463
- package/dist/extensions/builtin/browser/agent-workspace/domain-skills/coursera/scraping.md +360 -360
- package/dist/extensions/builtin/browser/agent-workspace/domain-skills/craigslist/scraping.md +390 -390
- package/dist/extensions/builtin/browser/agent-workspace/domain-skills/crossref/scraping.md +568 -568
- package/dist/extensions/builtin/browser/agent-workspace/domain-skills/dev-to/scraping.md +323 -323
- package/dist/extensions/builtin/browser/agent-workspace/domain-skills/duckduckgo/scraping.md +349 -349
- package/dist/extensions/builtin/browser/agent-workspace/domain-skills/ebay/scraping.md +435 -435
- package/dist/extensions/builtin/browser/agent-workspace/domain-skills/etsy/scraping.md +506 -506
- package/dist/extensions/builtin/browser/agent-workspace/domain-skills/eventbrite/scraping.md +363 -363
- package/dist/extensions/builtin/browser/agent-workspace/domain-skills/expedia/automation.md +168 -168
- package/dist/extensions/builtin/browser/agent-workspace/domain-skills/facebook/groups.md +236 -236
- package/dist/extensions/builtin/browser/agent-workspace/domain-skills/facebook/pages.md +295 -295
- package/dist/extensions/builtin/browser/agent-workspace/domain-skills/framer/editor.md +108 -108
- package/dist/extensions/builtin/browser/agent-workspace/domain-skills/fred/scraping.md +493 -493
- package/dist/extensions/builtin/browser/agent-workspace/domain-skills/g2/scraping.md +580 -580
- package/dist/extensions/builtin/browser/agent-workspace/domain-skills/genius/scraping.md +511 -511
- package/dist/extensions/builtin/browser/agent-workspace/domain-skills/github/repo-actions.md +65 -65
- package/dist/extensions/builtin/browser/agent-workspace/domain-skills/github/scraping.md +184 -184
- package/dist/extensions/builtin/browser/agent-workspace/domain-skills/glassdoor/scraping.md +543 -543
- package/dist/extensions/builtin/browser/agent-workspace/domain-skills/gmail/compose.md +122 -122
- package/dist/extensions/builtin/browser/agent-workspace/domain-skills/goodreads/scraping.md +461 -461
- package/dist/extensions/builtin/browser/agent-workspace/domain-skills/gutenberg/scraping.md +383 -383
- package/dist/extensions/builtin/browser/agent-workspace/domain-skills/hackernews/scraping.md +243 -243
- package/dist/extensions/builtin/browser/agent-workspace/domain-skills/howlongtobeat/scraping.md +473 -473
- package/dist/extensions/builtin/browser/agent-workspace/domain-skills/imdb/scraping.md +271 -271
- package/dist/extensions/builtin/browser/agent-workspace/domain-skills/itch-io/scraping.md +436 -436
- package/dist/extensions/builtin/browser/agent-workspace/domain-skills/job-boards/indeed-glassdoor.md +1021 -1021
- package/dist/extensions/builtin/browser/agent-workspace/domain-skills/letterboxd/scraping.md +349 -349
- package/dist/extensions/builtin/browser/agent-workspace/domain-skills/linkedin/invitation-manager.md +109 -109
- package/dist/extensions/builtin/browser/agent-workspace/domain-skills/loom/folder-enumeration.md +170 -170
- package/dist/extensions/builtin/browser/agent-workspace/domain-skills/macrotrends/scraping.md +537 -537
- package/dist/extensions/builtin/browser/agent-workspace/domain-skills/medium/article-hydration.md +120 -120
- package/dist/extensions/builtin/browser/agent-workspace/domain-skills/medium/scraping.md +414 -414
- package/dist/extensions/builtin/browser/agent-workspace/domain-skills/metacritic/scraping.md +477 -477
- package/dist/extensions/builtin/browser/agent-workspace/domain-skills/musicbrainz/scraping.md +478 -478
- package/dist/extensions/builtin/browser/agent-workspace/domain-skills/nasa/scraping.md +339 -339
- package/dist/extensions/builtin/browser/agent-workspace/domain-skills/news-aggregation/multi-source.md +205 -205
- package/dist/extensions/builtin/browser/agent-workspace/domain-skills/open-library/scraping.md +472 -472
- package/dist/extensions/builtin/browser/agent-workspace/domain-skills/openalex/scraping.md +470 -470
- package/dist/extensions/builtin/browser/agent-workspace/domain-skills/openstreetmap/scraping.md +490 -490
- package/dist/extensions/builtin/browser/agent-workspace/domain-skills/package-registries/npm-pypi.md +478 -478
- package/dist/extensions/builtin/browser/agent-workspace/domain-skills/polymarket/scraping.md +234 -234
- package/dist/extensions/builtin/browser/agent-workspace/domain-skills/producthunt/scraping.md +307 -307
- package/dist/extensions/builtin/browser/agent-workspace/domain-skills/pubmed/scraping.md +421 -421
- package/dist/extensions/builtin/browser/agent-workspace/domain-skills/quora/scraping.md +364 -364
- package/dist/extensions/builtin/browser/agent-workspace/domain-skills/rawg/scraping.md +352 -352
- package/dist/extensions/builtin/browser/agent-workspace/domain-skills/reddit/scraping.md +124 -124
- package/dist/extensions/builtin/browser/agent-workspace/domain-skills/rest-countries/scraping.md +233 -233
- package/dist/extensions/builtin/browser/agent-workspace/domain-skills/sec-edgar/scraping.md +361 -361
- package/dist/extensions/builtin/browser/agent-workspace/domain-skills/shopify-admin/README.md +36 -36
- package/dist/extensions/builtin/browser/agent-workspace/domain-skills/shopify-admin/embedded-apps.md +72 -72
- package/dist/extensions/builtin/browser/agent-workspace/domain-skills/shopify-admin/knowledge-base.md +109 -109
- package/dist/extensions/builtin/browser/agent-workspace/domain-skills/shopify-admin/polaris-inputs.md +137 -137
- package/dist/extensions/builtin/browser/agent-workspace/domain-skills/soundcloud/scraping.md +362 -362
- package/dist/extensions/builtin/browser/agent-workspace/domain-skills/spotify/scraping.md +339 -339
- package/dist/extensions/builtin/browser/agent-workspace/domain-skills/stackoverflow/scraping.md +435 -435
- package/dist/extensions/builtin/browser/agent-workspace/domain-skills/steam/scraping.md +575 -575
- package/dist/extensions/builtin/browser/agent-workspace/domain-skills/substack/scraping.md +338 -338
- package/dist/extensions/builtin/browser/agent-workspace/domain-skills/thetechgeeks/pricing.md +52 -52
- package/dist/extensions/builtin/browser/agent-workspace/domain-skills/tiktok/upload.md +107 -107
- package/dist/extensions/builtin/browser/agent-workspace/domain-skills/tradingview/scraping.md +309 -309
- package/dist/extensions/builtin/browser/agent-workspace/domain-skills/trello/boards-and-lists.md +88 -88
- package/dist/extensions/builtin/browser/agent-workspace/domain-skills/trustpilot/scraping.md +375 -375
- package/dist/extensions/builtin/browser/agent-workspace/domain-skills/walmart/scraping.md +444 -444
- package/dist/extensions/builtin/browser/agent-workspace/domain-skills/wayback-machine/scraping.md +306 -306
- package/dist/extensions/builtin/browser/agent-workspace/domain-skills/weather/scraping.md +398 -398
- package/dist/extensions/builtin/browser/agent-workspace/domain-skills/wellfound/scraping.md +596 -596
- package/dist/extensions/builtin/browser/agent-workspace/domain-skills/world-bank/scraping.md +356 -356
- package/dist/extensions/builtin/browser/agent-workspace/domain-skills/xiaohongshu/scraping.md +84 -84
- package/dist/extensions/builtin/browser/agent-workspace/domain-skills/youtube/scraping.md +418 -418
- package/dist/extensions/builtin/browser/agent-workspace/domain-skills/zillow/scraping.md +433 -433
- package/dist/extensions/builtin/browser/browser.md +73 -73
- package/dist/extensions/builtin/browser/install.md +142 -142
- package/dist/extensions/builtin/browser/interaction-skills/connection.md +48 -48
- package/dist/extensions/builtin/browser/interaction-skills/cookies.md +3 -3
- package/dist/extensions/builtin/browser/interaction-skills/cross-origin-iframes.md +3 -3
- package/dist/extensions/builtin/browser/interaction-skills/dialogs.md +64 -64
- package/dist/extensions/builtin/browser/interaction-skills/downloads.md +3 -3
- package/dist/extensions/builtin/browser/interaction-skills/drag-and-drop.md +3 -3
- package/dist/extensions/builtin/browser/interaction-skills/dropdowns.md +3 -3
- package/dist/extensions/builtin/browser/interaction-skills/iframes.md +3 -3
- package/dist/extensions/builtin/browser/interaction-skills/network-requests.md +3 -3
- package/dist/extensions/builtin/browser/interaction-skills/print-as-pdf.md +3 -3
- package/dist/extensions/builtin/browser/interaction-skills/profile-sync.md +90 -90
- package/dist/extensions/builtin/browser/interaction-skills/screenshots.md +17 -17
- package/dist/extensions/builtin/browser/interaction-skills/scrolling.md +3 -3
- package/dist/extensions/builtin/browser/interaction-skills/shadow-dom.md +3 -3
- package/dist/extensions/builtin/browser/interaction-skills/tabs.md +69 -69
- package/dist/extensions/builtin/browser/interaction-skills/uploads.md +1 -1
- package/dist/extensions/builtin/browser/interaction-skills/viewport.md +3 -3
- package/dist/extensions/builtin/browser/src/browser_harness/AGENT.md +15 -15
- package/dist/extensions/builtin/browser/src/browser_harness/__init__.py +8 -8
- package/dist/extensions/builtin/browser/src/browser_harness/_ipc.py +90 -90
- package/dist/extensions/builtin/browser/src/browser_harness/admin.py +722 -722
- package/dist/extensions/builtin/browser/src/browser_harness/daemon.py +328 -328
- package/dist/extensions/builtin/browser/src/browser_harness/helpers.py +396 -396
- package/dist/extensions/builtin/browser/src/browser_harness/run.py +103 -103
- package/dist/extensions/builtin/discipline/skills/brainstorming/SKILL.md +33 -33
- package/dist/extensions/builtin/discipline/skills/executing-plans/SKILL.md +25 -25
- package/dist/extensions/builtin/discipline/skills/finishing-development-branch/SKILL.md +25 -25
- package/dist/extensions/builtin/discipline/skills/receiving-code-review/SKILL.md +22 -22
- package/dist/extensions/builtin/discipline/skills/requesting-code-review/SKILL.md +31 -31
- package/dist/extensions/builtin/discipline/skills/systematic-debugging/SKILL.md +28 -28
- package/dist/extensions/builtin/discipline/skills/test-driven-development/SKILL.md +32 -32
- package/dist/extensions/builtin/discipline/skills/using-git-worktrees/SKILL.md +25 -25
- package/dist/extensions/builtin/discipline/skills/verification-before-completion/SKILL.md +27 -27
- package/dist/extensions/builtin/discipline/skills/writing-plans/SKILL.md +26 -26
- package/dist/extensions/builtin/goal/README.md +67 -67
- package/dist/extensions/builtin/grub/README.md +112 -112
- package/dist/extensions/builtin/link-world/agent-workspace/README.md +16 -16
- package/dist/extensions/builtin/link-world/internet-search/internet-search.md +65 -65
- package/dist/extensions/builtin/link-world/link-world-agent.md +82 -82
- package/dist/extensions/builtin/link-world/linkworld.md +313 -313
- package/dist/extensions/builtin/link-world/network-routing/network-routing.md +67 -67
- package/dist/extensions/builtin/loop/README.md +92 -92
- package/dist/extensions/builtin/mcp/figma-design.md +68 -68
- package/dist/extensions/builtin/mcp/mcp-management.md +85 -85
- package/dist/extensions/builtin/recap/AGENT.md +15 -15
- package/dist/extensions/builtin/sal/README.md +72 -72
- package/dist/extensions/builtin/security-audit/README.md +289 -289
- package/dist/extensions/builtin/team/AGENT.md +112 -112
- package/dist/extensions/builtin/team/TESTING.md +299 -299
- package/dist/extensions/builtin/token-save/README.md +56 -56
- package/dist/extensions/optional/AGENT.md +10 -10
- package/dist/modes/interactive/controllers/input-submit-controller.js +2 -2
- package/dist/modes/interactive/controllers/stream-render-controller.js +2 -2
- package/dist/modes/interactive/interactive-mode.js +19 -19
- package/dist/modes/interactive/theme/dark.json +85 -85
- package/dist/modes/interactive/theme/light.json +84 -84
- package/dist/modes/interactive/theme/theme-schema.json +335 -335
- package/dist/modes/interactive/theme/warm.json +81 -81
- package/dist/node_modules/@pencil-agent/ai/dist/cli.js +0 -0
- package/dist/node_modules/@pencil-agent/ai/dist/models.generated.js +1 -1
- package/docs/ACP/345/215/217/350/256/256/351/233/206/346/210/220/345/274/200/345/217/221/346/226/207/346/241/243.md +851 -0
- package/docs/SDK-TESTING.md +364 -0
- package/docs/codex-goal-command-impl.md +1055 -1055
- package/docs/codex-goal-vs-grub.md +500 -500
- package/docs/custom-provider.md +27 -27
- package/docs/extensions.md +27 -27
- package/docs/keybindings.md +27 -27
- package/docs/loop /351/207/215/346/236/204/345/256/214/346/210/220/346/200/273/347/273/223.md" +250 -250
- package/docs/loop /351/207/215/346/236/204/345/256/214/346/210/220/346/212/245/345/221/212.md" +122 -122
- package/docs/loop /351/207/215/346/236/204/346/226/271/346/241/210.md" +1222 -1222
- package/docs/loop /351/207/215/346/236/204/346/226/271/346/241/210/345/256/236/347/216/260/346/212/245/345/221/212.md" +158 -158
- package/docs/loop /351/207/215/346/236/204/346/226/271/346/241/210/345/257/271/346/257/224/345/210/206/346/236/220.md" +128 -128
- package/docs/loop /351/207/215/346/236/204/350/256/241/345/210/222.md" +320 -320
- package/docs/loop-usage-examples.md +214 -214
- package/docs/mem-core/346/212/200/346/234/257/346/226/207/346/241/243.md +593 -0
- package/docs/models.md +27 -27
- package/docs/packages.md +27 -27
- package/docs/pi-design-philosophy.md +457 -457
- package/docs/planmode.md +1987 -1987
- package/docs/prompt-templates.md +27 -27
- package/docs/providers.md +27 -27
- package/docs/sdk.md +27 -27
- package/docs/skills.md +27 -27
- package/docs/startup-performance-optimization.md +301 -0
- package/docs/themes.md +27 -27
- package/docs/tui.md +27 -27
- package/docs//350/256/244/347/237/245/345/234/260/345/233/276.md +47 -0
- package/package.json +190 -190
- package/docs/cc-agent-design.md +0 -1297
- package/docs/cc-tui-design.md +0 -1333
- package/docs/nanoPencil-/345/255/246/344/271/240/350/256/241/345/210/222.md +0 -170
- package/docs/scan-report.md +0 -3820
- package/docs//345/257/271/346/240/207Claude-Code.md +0 -1775
- package/docs//351/230/277/351/207/214/345/267/264/345/267/264/350/264/242/346/212/245/345/210/206/346/236/220/344/271/246.md +0 -261
|
@@ -1,436 +1,436 @@
|
|
|
1
|
-
# itch.io — Scraping & Data Extraction
|
|
2
|
-
|
|
3
|
-
Field-tested against itch.io on 2026-04-18. All code blocks validated with live requests.
|
|
4
|
-
|
|
5
|
-
---
|
|
6
|
-
|
|
7
|
-
## TL;DR — fastest approaches by task
|
|
8
|
-
|
|
9
|
-
| Task | Method | Notes |
|
|
10
|
-
|---|---|---|
|
|
11
|
-
| Browse listings (36/page) | `http_get` HTML | Works, no key, no bot block |
|
|
12
|
-
| Game detail (name, price, rating) | `http_get` + JSON-LD | `<script type="application/ld+json">` Product block |
|
|
13
|
-
| Info table (tags, genre, status) | `http_get` + regex on `game_info_panel_widget` | Always present |
|
|
14
|
-
| Top N games from any category | RSS `.xml` feed | Cleaner than HTML for bulk |
|
|
15
|
-
| API (key endpoints) | `http_get` + key in path | Free keys at itch.io/docs/api |
|
|
16
|
-
| Download/purchase counts | Not public | Owners only via dashboard |
|
|
17
|
-
|
|
18
|
-
`http_get` works on all itch.io game and browse pages with no extra headers needed.
|
|
19
|
-
No Cloudflare, no JS challenge, no CAPTCHA on standard game/browse routes.
|
|
20
|
-
|
|
21
|
-
---
|
|
22
|
-
|
|
23
|
-
## Approach 1 (Fastest for listings): RSS feeds — 36 games per call, clean XML
|
|
24
|
-
|
|
25
|
-
Every browse URL has an `.xml` RSS variant. Returns price, pub/update dates, platforms, thumbnail. No HTML parsing.
|
|
26
|
-
|
|
27
|
-
```python
|
|
28
|
-
import re
|
|
29
|
-
from helpers import http_get
|
|
30
|
-
|
|
31
|
-
def parse_rss(url):
|
|
32
|
-
"""
|
|
33
|
-
Parse any itch.io RSS listing feed.
|
|
34
|
-
url examples:
|
|
35
|
-
https://itch.io/games/top-rated.xml
|
|
36
|
-
https://itch.io/games/newest.xml
|
|
37
|
-
https://itch.io/games/featured.xml
|
|
38
|
-
https://itch.io/games/on-sale.xml
|
|
39
|
-
https://itch.io/games/free.xml
|
|
40
|
-
https://itch.io/games/tag-puzzle.xml # any tag slug works
|
|
41
|
-
https://itch.io/games/top-rated.xml?page=2
|
|
42
|
-
"""
|
|
43
|
-
xml = http_get(url)
|
|
44
|
-
items = []
|
|
45
|
-
for m in re.finditer(r'<item>(.*?)</item>', xml, re.DOTALL):
|
|
46
|
-
ix = m.group(1)
|
|
47
|
-
def get(tag, s=ix):
|
|
48
|
-
tm = re.search(rf'<{tag}>(.*?)</{tag}>', s, re.DOTALL)
|
|
49
|
-
return tm.group(1).strip() if tm else None
|
|
50
|
-
items.append({
|
|
51
|
-
'url': get('guid'),
|
|
52
|
-
'title': get('plainTitle'), # clean title, no [tags]
|
|
53
|
-
'price': get('price'), # "$0.00", "$7.99", etc.
|
|
54
|
-
'currency': get('currency'), # "USD"
|
|
55
|
-
'pub_date': get('pubDate'),
|
|
56
|
-
'update_date': get('updateDate'),
|
|
57
|
-
'image': get('imageurl'), # 315x250 thumbnail
|
|
58
|
-
'platforms': {
|
|
59
|
-
k: get(k) == 'yes'
|
|
60
|
-
for k in ['windows', 'osx', 'linux', 'android', 'html']
|
|
61
|
-
if get(k) is not None
|
|
62
|
-
},
|
|
63
|
-
})
|
|
64
|
-
return items
|
|
65
|
-
|
|
66
|
-
# Confirmed output:
|
|
67
|
-
items = parse_rss("https://itch.io/games/top-rated.xml")
|
|
68
|
-
# items[0] -> {
|
|
69
|
-
# 'url': 'https://gbpatch.itch.io/our-life',
|
|
70
|
-
# 'title': 'Our Life: Beginnings & Always',
|
|
71
|
-
# 'price': '$0.00',
|
|
72
|
-
# 'currency': 'USD',
|
|
73
|
-
# 'pub_date': 'Fri, 07 Jun 2019 23:47:57 GMT',
|
|
74
|
-
# 'update_date': 'Sun, 22 May 2022 15:48:27 GMT',
|
|
75
|
-
# 'image': 'https://img.itch.zone/aW1nLzcwMTIxNDMucG5n/315x250%23c/BalGQb.png',
|
|
76
|
-
# 'platforms': {'windows': True, 'osx': True, 'linux': True, 'android': True},
|
|
77
|
-
# }
|
|
78
|
-
```
|
|
79
|
-
|
|
80
|
-
**RSS limitations:** no rating score or count. Use HTML scraping (Approach 2) when you need ratings.
|
|
81
|
-
|
|
82
|
-
---
|
|
83
|
-
|
|
84
|
-
## Approach 2: HTML listings — ratings, genre, price, 36 games per page
|
|
85
|
-
|
|
86
|
-
```python
|
|
87
|
-
import re
|
|
88
|
-
from helpers import http_get
|
|
89
|
-
|
|
90
|
-
def parse_game_cards(html):
|
|
91
|
-
"""
|
|
92
|
-
Extract all game cards from any itch.io browse/listing/search/profile HTML page.
|
|
93
|
-
Works on:
|
|
94
|
-
https://itch.io/games/top-rated
|
|
95
|
-
https://itch.io/games/newest
|
|
96
|
-
https://itch.io/games/featured
|
|
97
|
-
https://itch.io/games/on-sale
|
|
98
|
-
https://itch.io/games/free
|
|
99
|
-
https://itch.io/games/tag-puzzle (genre/tag path)
|
|
100
|
-
https://itch.io/search?q=platformer (search — 54 cards per page)
|
|
101
|
-
https://<author>.itch.io (author profile)
|
|
102
|
-
All accept ?page=N for pagination.
|
|
103
|
-
"""
|
|
104
|
-
games = []
|
|
105
|
-
for m in re.finditer(r'data-game_id="(\d+)"', html):
|
|
106
|
-
game_id = m.group(1)
|
|
107
|
-
start = m.start()
|
|
108
|
-
chunk = html[start:start + 3000]
|
|
109
|
-
|
|
110
|
-
# Title + URL — attribute order differs between page 1 and pages 2+
|
|
111
|
-
title_m = re.search(
|
|
112
|
-
r'class="title game_link"[^>]*href="([^"]+)"[^>]*>([^<]+)</a>', chunk
|
|
113
|
-
)
|
|
114
|
-
if not title_m:
|
|
115
|
-
title_m = re.search(
|
|
116
|
-
r'href="([^"]+)"[^>]*class="title game_link"[^>]*>([^<]+)</a>', chunk
|
|
117
|
-
)
|
|
118
|
-
|
|
119
|
-
rating_m = re.search(
|
|
120
|
-
r'data-tooltip="([\d.]+) average rating from ([\d,]+) total ratings"', chunk
|
|
121
|
-
)
|
|
122
|
-
genre_m = re.search(r'class="game_genre">([^<]+)</div>', chunk)
|
|
123
|
-
price_m = re.search(r'class="price_value">([^<]+)</div>', chunk)
|
|
124
|
-
desc_m = re.search(r'class="game_text" title="([^"]+)"', chunk)
|
|
125
|
-
img_m = re.search(r'data-lazy_src="([^"]+)"', chunk)
|
|
126
|
-
platforms = re.findall(r'title="Download for ([^"]+)"', chunk)
|
|
127
|
-
|
|
128
|
-
games.append({
|
|
129
|
-
'id': game_id,
|
|
130
|
-
'url': title_m.group(1) if title_m else None,
|
|
131
|
-
'title': title_m.group(2).strip() if title_m else None,
|
|
132
|
-
'rating': float(rating_m.group(1)) if rating_m else None,
|
|
133
|
-
'rating_count': int(rating_m.group(2).replace(',', '')) if rating_m else None,
|
|
134
|
-
'genre': genre_m.group(1) if genre_m else None,
|
|
135
|
-
'price': price_m.group(1) if price_m else 'Free',
|
|
136
|
-
'description': desc_m.group(1) if desc_m else None,
|
|
137
|
-
'thumbnail': img_m.group(1) if img_m else None,
|
|
138
|
-
'platforms': platforms, # ['Windows', 'macOS', 'Linux', 'Android']
|
|
139
|
-
})
|
|
140
|
-
return games
|
|
141
|
-
|
|
142
|
-
# Usage:
|
|
143
|
-
html = http_get("https://itch.io/games/top-rated")
|
|
144
|
-
games = parse_game_cards(html)
|
|
145
|
-
# games[0] -> {
|
|
146
|
-
# 'id': '434554', 'url': 'https://gbpatch.itch.io/our-life',
|
|
147
|
-
# 'title': 'Our Life: Beginnings & Always',
|
|
148
|
-
# 'rating': 4.94, 'rating_count': 7191,
|
|
149
|
-
# 'genre': 'Visual Novel', 'price': 'Free',
|
|
150
|
-
# 'platforms': ['Windows', 'Linux', 'macOS', 'Android'],
|
|
151
|
-
# }
|
|
152
|
-
|
|
153
|
-
# Paid game example:
|
|
154
|
-
html = http_get("https://itch.io/games/top-rated?page=5")
|
|
155
|
-
games = parse_game_cards(html)
|
|
156
|
-
# Returns games where price_m captures '$7.99' when present
|
|
157
|
-
```
|
|
158
|
-
|
|
159
|
-
### CSS selector reference (for browser/JS use)
|
|
160
|
-
|
|
161
|
-
```
|
|
162
|
-
.game_cell — one card per game
|
|
163
|
-
.game_cell[data-game_id] — get game ID from attribute
|
|
164
|
-
.game_cell .title.game_link — title text + href
|
|
165
|
-
.game_cell .game_rating — rating container
|
|
166
|
-
.game_cell .game_rating[data-tooltip] — "4.94 average rating from 7,191 total ratings"
|
|
167
|
-
.game_cell .star_fill — inline style width: NN% (rating as percentage of 5)
|
|
168
|
-
.game_cell .rating_count — "(7,191)"
|
|
169
|
-
.game_cell .game_genre — genre text
|
|
170
|
-
.game_cell .price_tag .price_value — price e.g. "$7.99" (absent = Free)
|
|
171
|
-
.game_cell .game_text — one-line description (also in title attr)
|
|
172
|
-
.game_cell .game_author a — author name + href
|
|
173
|
-
.game_cell img.lazy_loaded — thumbnail (src in data-lazy_src before JS runs)
|
|
174
|
-
```
|
|
175
|
-
|
|
176
|
-
**Gotcha — attribute order flips on page >= 2.** Page 1 uses `class="..." data-game_id="..."`, page 2+ uses `data-game_id="..." class="..."`. The regex above handles both. If you use a CSS selector engine, `[data-game_id]` is unambiguous.
|
|
177
|
-
|
|
178
|
-
**Gotcha — ratings absent on some listing types.** The tag/genre browse pages (e.g. `/games/tag-puzzle`) sometimes omit the rating tooltip on the card even when the game has ratings. Fetch the detail page for the authoritative rating.
|
|
179
|
-
|
|
180
|
-
---
|
|
181
|
-
|
|
182
|
-
## Approach 3: Game detail page — JSON-LD Product schema
|
|
183
|
-
|
|
184
|
-
The cleanest source for individual game data. All confirmed fields:
|
|
185
|
-
|
|
186
|
-
```python
|
|
187
|
-
import json, re
|
|
188
|
-
from helpers import http_get
|
|
189
|
-
|
|
190
|
-
def extract_game_detail(url):
|
|
191
|
-
"""
|
|
192
|
-
Fetch full metadata for a single itch.io game.
|
|
193
|
-
url format: https://<author>.itch.io/<game-slug>
|
|
194
|
-
"""
|
|
195
|
-
html = http_get(url)
|
|
196
|
-
|
|
197
|
-
# --- JSON-LD (always present, covers name/description/price/rating) ---
|
|
198
|
-
ld_product = None
|
|
199
|
-
for block in re.findall(
|
|
200
|
-
r'<script[^>]*type="application/ld\+json"[^>]*>(.*?)</script>',
|
|
201
|
-
html, re.DOTALL
|
|
202
|
-
):
|
|
203
|
-
ld = json.loads(block.strip())
|
|
204
|
-
if ld.get('@type') == 'Product':
|
|
205
|
-
ld_product = ld
|
|
206
|
-
break
|
|
207
|
-
|
|
208
|
-
# --- Info panel table (Status, Platforms, Genre, Tags, Author, etc.) ---
|
|
209
|
-
info = {}
|
|
210
|
-
panel_m = re.search(
|
|
211
|
-
r'class="game_info_panel_widget[^"]*"[^>]*><table>(.*?)</table>',
|
|
212
|
-
html, re.DOTALL
|
|
213
|
-
)
|
|
214
|
-
if panel_m:
|
|
215
|
-
for row in re.finditer(
|
|
216
|
-
r'<tr><td>([^<]+)</td><td>(.*?)</td></tr>',
|
|
217
|
-
panel_m.group(1), re.DOTALL
|
|
218
|
-
):
|
|
219
|
-
key = row.group(1).strip()
|
|
220
|
-
val = re.sub(r'<[^>]+>', '', row.group(2)).strip()
|
|
221
|
-
# Multi-value fields become lists (Tags, Platforms, Genre, Links)
|
|
222
|
-
info[key] = [v.strip() for v in val.split(',')] if ',' in val else val
|
|
223
|
-
|
|
224
|
-
# --- Cover image ---
|
|
225
|
-
cover_m = re.search(r'<meta property="og:image" content="([^"]+)"', html)
|
|
226
|
-
|
|
227
|
-
offers = (ld_product or {}).get('offers', {})
|
|
228
|
-
agg = (ld_product or {}).get('aggregateRating', {})
|
|
229
|
-
|
|
230
|
-
return {
|
|
231
|
-
'url': url,
|
|
232
|
-
'name': (ld_product or {}).get('name'),
|
|
233
|
-
'description': (ld_product or {}).get('description'),
|
|
234
|
-
'price': offers.get('price'), # "0.00" for free, "7.99" for paid
|
|
235
|
-
'currency': offers.get('priceCurrency'), # "USD"
|
|
236
|
-
'rating': agg.get('ratingValue'), # "4.9" string
|
|
237
|
-
'rating_count': agg.get('ratingCount'), # int
|
|
238
|
-
'cover': cover_m.group(1) if cover_m else None,
|
|
239
|
-
'info': info,
|
|
240
|
-
}
|
|
241
|
-
|
|
242
|
-
# Free game:
|
|
243
|
-
r = extract_game_detail("https://gbpatch.itch.io/our-life")
|
|
244
|
-
# {
|
|
245
|
-
# 'name': 'Our Life: Beginnings & Always',
|
|
246
|
-
# 'description': 'Grow from childhood to adulthood with the lonely boy next door...',
|
|
247
|
-
# 'price': None, 'currency': None, <- no 'offers' block for free games
|
|
248
|
-
# 'rating': '4.9', 'rating_count': 7191,
|
|
249
|
-
# 'cover': 'https://img.itch.zone/aW1hZ2Uv.../347x500/7HqrvV.jpg',
|
|
250
|
-
# 'info': {
|
|
251
|
-
# 'Status': 'Released',
|
|
252
|
-
# 'Platforms': ['Windows', 'macOS', 'Linux', 'Android'],
|
|
253
|
-
# 'Rating': 'Rated 4.9 out of 5 stars(7,191 total ratings)',
|
|
254
|
-
# 'Author': 'GBPatch',
|
|
255
|
-
# 'Genre': ['Visual Novel', 'Interactive Fiction'],
|
|
256
|
-
# 'Tags': ['Amare', 'Comedy', 'Dating Sim', 'Gay', 'LGBT', ...],
|
|
257
|
-
# 'Links': 'Steam',
|
|
258
|
-
# }
|
|
259
|
-
# }
|
|
260
|
-
|
|
261
|
-
# Paid game:
|
|
262
|
-
r = extract_game_detail("https://adamgryu.itch.io/a-short-hike")
|
|
263
|
-
# {
|
|
264
|
-
# 'name': 'A Short Hike',
|
|
265
|
-
# 'price': '7.99', 'currency': 'USD',
|
|
266
|
-
# 'rating': '4.9', 'rating_count': 4307,
|
|
267
|
-
# 'info': {
|
|
268
|
-
# 'Status': 'Released',
|
|
269
|
-
# 'Platforms': ['Windows', 'macOS', 'Linux'],
|
|
270
|
-
# 'Release date': 'Jul 30, 2019',
|
|
271
|
-
# 'Genre': ['Adventure', 'Platformer'],
|
|
272
|
-
# 'Made with': 'Unity',
|
|
273
|
-
# 'Tags': ['3D', 'Atmospheric', 'Cute', 'Relaxing', 'Short', ...],
|
|
274
|
-
# 'Average session': 'About an hour',
|
|
275
|
-
# 'Languages': ['English', 'Spanish; Latin America', 'French', ...],
|
|
276
|
-
# 'Inputs': ['Keyboard', 'Mouse', 'Xbox controller', ...],
|
|
277
|
-
# 'Accessibility': ['Subtitles', 'Configurable controls'],
|
|
278
|
-
# 'Links': ['Steam', 'Homepage', 'Soundtrack', 'Twitter/X'],
|
|
279
|
-
# }
|
|
280
|
-
# }
|
|
281
|
-
```
|
|
282
|
-
|
|
283
|
-
**JSON-LD available fields:**
|
|
284
|
-
|
|
285
|
-
| Field | Free game | Paid game |
|
|
286
|
-
|---|---|---|
|
|
287
|
-
| `@type` | `Product` | `Product` |
|
|
288
|
-
| `name` | yes | yes |
|
|
289
|
-
| `description` | yes | yes |
|
|
290
|
-
| `aggregateRating.ratingValue` | yes | yes |
|
|
291
|
-
| `aggregateRating.ratingCount` | yes | yes |
|
|
292
|
-
| `offers.price` | absent | yes ("7.99") |
|
|
293
|
-
| `offers.priceCurrency` | absent | yes ("USD") |
|
|
294
|
-
| `offers.seller.name` | absent | yes (author name) |
|
|
295
|
-
| `offers.seller.url` | absent | yes (author profile URL) |
|
|
296
|
-
|
|
297
|
-
---
|
|
298
|
-
|
|
299
|
-
## Pagination
|
|
300
|
-
|
|
301
|
-
Browse pages: `?page=N`. Detect end of results by HTTP 404 (page too high) or absent `<link rel="next">`.
|
|
302
|
-
|
|
303
|
-
```python
|
|
304
|
-
import re
|
|
305
|
-
from helpers import http_get
|
|
306
|
-
|
|
307
|
-
def paginate_listing(base_url, max_pages=10):
|
|
308
|
-
"""
|
|
309
|
-
Scrape multiple pages from any itch.io browse URL.
|
|
310
|
-
base_url: https://itch.io/games/top-rated (no ?page= suffix)
|
|
311
|
-
Returns flat list of game dicts.
|
|
312
|
-
Stops when HTTP 404 or no <link rel="next"> found.
|
|
313
|
-
"""
|
|
314
|
-
all_games = []
|
|
315
|
-
page = 1
|
|
316
|
-
while page <= max_pages:
|
|
317
|
-
url = base_url if page == 1 else f"{base_url}?page={page}"
|
|
318
|
-
try:
|
|
319
|
-
html = http_get(url)
|
|
320
|
-
except Exception:
|
|
321
|
-
break # 404 = past last page
|
|
322
|
-
all_games.extend(parse_game_cards(html))
|
|
323
|
-
if not re.search(r'<link[^>]+rel="next"[^>]*/>', html):
|
|
324
|
-
break
|
|
325
|
-
page += 1
|
|
326
|
-
return all_games
|
|
327
|
-
|
|
328
|
-
# Confirmed: page 1 has <link href="?page=2" rel="next"/>
|
|
329
|
-
# page 2 has <link rel="prev" href="/games/top-rated"/> and <link rel="next" href="?page=3"/>
|
|
330
|
-
# past last page returns HTTP 404
|
|
331
|
-
# top-rated has at least 200 pages (each 36 games); page 300+ -> 404
|
|
332
|
-
```
|
|
333
|
-
|
|
334
|
-
---
|
|
335
|
-
|
|
336
|
-
## Browse URL patterns
|
|
337
|
-
|
|
338
|
-
All confirmed working via `http_get`:
|
|
339
|
-
|
|
340
|
-
```python
|
|
341
|
-
BASE = "https://itch.io/games"
|
|
342
|
-
|
|
343
|
-
# Sort orders
|
|
344
|
-
f"{BASE}/top-rated" # all-time top rated (rated by community, 0–5 stars)
|
|
345
|
-
f"{BASE}/newest" # most recently published
|
|
346
|
-
f"{BASE}/featured" # itch.io staff picks
|
|
347
|
-
f"{BASE}/on-sale" # discounted games
|
|
348
|
-
f"{BASE}/free" # free games only
|
|
349
|
-
|
|
350
|
-
# Genre/tag paths (append .xml for RSS)
|
|
351
|
-
f"{BASE}/tag-puzzle" # tag slug — prefix with 'tag-'
|
|
352
|
-
f"{BASE}/genre-action" # genre — prefix with 'genre-' (less common)
|
|
353
|
-
|
|
354
|
-
# Combine: tag + sort via separate pages (no combined URL that survives http_get)
|
|
355
|
-
# Note: https://itch.io/games/top-rated/tag-puzzle -> HTTP 403
|
|
356
|
-
# Note: ?tag= query param does NOT filter server-side (returns same games)
|
|
357
|
-
|
|
358
|
-
# Pagination
|
|
359
|
-
f"{BASE}/top-rated?page=2"
|
|
360
|
-
f"{BASE}/tag-puzzle?page=3"
|
|
361
|
-
|
|
362
|
-
# RSS equivalents (36 items, no pagination needed for small sets)
|
|
363
|
-
f"{BASE}/top-rated.xml"
|
|
364
|
-
f"{BASE}/tag-puzzle.xml"
|
|
365
|
-
f"{BASE}/tag-puzzle.xml?page=2"
|
|
366
|
-
|
|
367
|
-
# Search (54 results/page, no server-side pagination beyond page 1 via http_get)
|
|
368
|
-
"https://itch.io/search?q=platformer"
|
|
369
|
-
|
|
370
|
-
# Author profile
|
|
371
|
-
"https://<author-slug>.itch.io"
|
|
372
|
-
```
|
|
373
|
-
|
|
374
|
-
---
|
|
375
|
-
|
|
376
|
-
## API (requires key)
|
|
377
|
-
|
|
378
|
-
itch.io has an official REST API. A free key is issued per-account with no rate limit published.
|
|
379
|
-
Get one at: `https://itch.io/user/settings/api-keys`
|
|
380
|
-
|
|
381
|
-
Base URL: `https://itch.io/api/1/<key>/`
|
|
382
|
-
|
|
383
|
-
```python
|
|
384
|
-
import json
|
|
385
|
-
from helpers import http_get
|
|
386
|
-
|
|
387
|
-
ITCH_KEY = "your_api_key_here" # from https://itch.io/user/settings/api-keys
|
|
388
|
-
|
|
389
|
-
def api(path):
|
|
390
|
-
return json.loads(http_get(f"https://itch.io/api/1/{ITCH_KEY}/{path}"))
|
|
391
|
-
|
|
392
|
-
# Authenticated user info
|
|
393
|
-
api("me")
|
|
394
|
-
# -> {"user": {"id": ..., "username": "...", "url": "...", "display_name": "...", ...}}
|
|
395
|
-
|
|
396
|
-
# Games owned by authenticated user
|
|
397
|
-
api("my-games")
|
|
398
|
-
# -> {"games": [{"id": ..., "title": "...", "url": "...", "created_at": "...",
|
|
399
|
-
# "published": true/false, "min_price": 0, ...}, ...]}
|
|
400
|
-
|
|
401
|
-
# Download keys for a game (owner only)
|
|
402
|
-
api("game/434554/download_keys")
|
|
403
|
-
|
|
404
|
-
# Credentials (for authenticated purchases)
|
|
405
|
-
api("game/434554/credentials")
|
|
406
|
-
```
|
|
407
|
-
|
|
408
|
-
**Error structure:** invalid/missing key returns `{"errors": ["invalid key"]}` with HTTP 200.
|
|
409
|
-
Non-existent endpoints return HTTP 404.
|
|
410
|
-
|
|
411
|
-
**No unauthenticated game lookup API.** `https://itch.io/api/1/x/games` -> HTTP 404.
|
|
412
|
-
Use HTML scraping or RSS for unauthenticated game data.
|
|
413
|
-
|
|
414
|
-
---
|
|
415
|
-
|
|
416
|
-
## Gotchas
|
|
417
|
-
|
|
418
|
-
1. **Attribute order flips page 1 vs 2+.** On page 1, game cards use `class="game_cell ..." data-game_id="..."`. On pages 2+, the order is `data-game_id="..." class="game_cell ..."`. Always match `data-game_id` independently of class ordering.
|
|
419
|
-
|
|
420
|
-
2. **Ratings absent on tag/genre listing pages.** The `data-tooltip` with rating is often missing from card HTML on `/games/tag-*` pages even though the game has ratings. Fetch the detail page for `aggregateRating` via JSON-LD.
|
|
421
|
-
|
|
422
|
-
3. **`price_value` absent = Free.** Paid games have `<div class="price_tag meta_tag" title="Pay $7.99 or more..."><div class="price_value">$7.99</div></div>`. Free games have no such element. Default to `'Free'` when absent.
|
|
423
|
-
|
|
424
|
-
4. **Free-game JSON-LD has no `offers` block.** Only paid games include the `offers` object. For free games, use absence of `offers` as the signal, not presence of `price: 0`.
|
|
425
|
-
|
|
426
|
-
5. **`/games/top-rated/tag-puzzle` returns HTTP 403.** Cannot combine sort + tag in a path. Use separate `/games/tag-puzzle` (top-rated is the default sort anyway).
|
|
427
|
-
|
|
428
|
-
6. **`?tag=` query param is ignored server-side.** `https://itch.io/games/top-rated?tag=puzzle` returns the same games as `?top-rated`. Use `/games/tag-puzzle` path instead.
|
|
429
|
-
|
|
430
|
-
7. **Download/purchase counts are not public.** No count field appears anywhere in the public HTML, JSON-LD, RSS, or unauthenticated API. Game owners see their stats in the dashboard only.
|
|
431
|
-
|
|
432
|
-
8. **Search beyond page 1 is AJAX-only.** `https://itch.io/search?q=X&page=2` via `http_get` returns the same 54 results as page 1. To get more search results use the browser and scroll/click "load more".
|
|
433
|
-
|
|
434
|
-
9. **RSS is capped at 36 items per page.** Paginate with `?page=N`. Very high page numbers (300+) return HTTP 404 on browse pages.
|
|
435
|
-
|
|
436
|
-
10. **Unicode zero-width space in some titles.** `\u200b` (zero-width space) appears at the start of certain titles (e.g. "Our Life: Beginnings & Always"). Strip with `.replace('\u200b', '').strip()` or `.strip()` alone won't remove it — use `title.replace('\u200b', '').strip()`.
|
|
1
|
+
# itch.io — Scraping & Data Extraction
|
|
2
|
+
|
|
3
|
+
Field-tested against itch.io on 2026-04-18. All code blocks validated with live requests.
|
|
4
|
+
|
|
5
|
+
---
|
|
6
|
+
|
|
7
|
+
## TL;DR — fastest approaches by task
|
|
8
|
+
|
|
9
|
+
| Task | Method | Notes |
|
|
10
|
+
|---|---|---|
|
|
11
|
+
| Browse listings (36/page) | `http_get` HTML | Works, no key, no bot block |
|
|
12
|
+
| Game detail (name, price, rating) | `http_get` + JSON-LD | `<script type="application/ld+json">` Product block |
|
|
13
|
+
| Info table (tags, genre, status) | `http_get` + regex on `game_info_panel_widget` | Always present |
|
|
14
|
+
| Top N games from any category | RSS `.xml` feed | Cleaner than HTML for bulk |
|
|
15
|
+
| API (key endpoints) | `http_get` + key in path | Free keys at itch.io/docs/api |
|
|
16
|
+
| Download/purchase counts | Not public | Owners only via dashboard |
|
|
17
|
+
|
|
18
|
+
`http_get` works on all itch.io game and browse pages with no extra headers needed.
|
|
19
|
+
No Cloudflare, no JS challenge, no CAPTCHA on standard game/browse routes.
|
|
20
|
+
|
|
21
|
+
---
|
|
22
|
+
|
|
23
|
+
## Approach 1 (Fastest for listings): RSS feeds — 36 games per call, clean XML
|
|
24
|
+
|
|
25
|
+
Every browse URL has an `.xml` RSS variant. Returns price, pub/update dates, platforms, thumbnail. No HTML parsing.
|
|
26
|
+
|
|
27
|
+
```python
|
|
28
|
+
import re
|
|
29
|
+
from helpers import http_get
|
|
30
|
+
|
|
31
|
+
def parse_rss(url):
|
|
32
|
+
"""
|
|
33
|
+
Parse any itch.io RSS listing feed.
|
|
34
|
+
url examples:
|
|
35
|
+
https://itch.io/games/top-rated.xml
|
|
36
|
+
https://itch.io/games/newest.xml
|
|
37
|
+
https://itch.io/games/featured.xml
|
|
38
|
+
https://itch.io/games/on-sale.xml
|
|
39
|
+
https://itch.io/games/free.xml
|
|
40
|
+
https://itch.io/games/tag-puzzle.xml # any tag slug works
|
|
41
|
+
https://itch.io/games/top-rated.xml?page=2
|
|
42
|
+
"""
|
|
43
|
+
xml = http_get(url)
|
|
44
|
+
items = []
|
|
45
|
+
for m in re.finditer(r'<item>(.*?)</item>', xml, re.DOTALL):
|
|
46
|
+
ix = m.group(1)
|
|
47
|
+
def get(tag, s=ix):
|
|
48
|
+
tm = re.search(rf'<{tag}>(.*?)</{tag}>', s, re.DOTALL)
|
|
49
|
+
return tm.group(1).strip() if tm else None
|
|
50
|
+
items.append({
|
|
51
|
+
'url': get('guid'),
|
|
52
|
+
'title': get('plainTitle'), # clean title, no [tags]
|
|
53
|
+
'price': get('price'), # "$0.00", "$7.99", etc.
|
|
54
|
+
'currency': get('currency'), # "USD"
|
|
55
|
+
'pub_date': get('pubDate'),
|
|
56
|
+
'update_date': get('updateDate'),
|
|
57
|
+
'image': get('imageurl'), # 315x250 thumbnail
|
|
58
|
+
'platforms': {
|
|
59
|
+
k: get(k) == 'yes'
|
|
60
|
+
for k in ['windows', 'osx', 'linux', 'android', 'html']
|
|
61
|
+
if get(k) is not None
|
|
62
|
+
},
|
|
63
|
+
})
|
|
64
|
+
return items
|
|
65
|
+
|
|
66
|
+
# Confirmed output:
|
|
67
|
+
items = parse_rss("https://itch.io/games/top-rated.xml")
|
|
68
|
+
# items[0] -> {
|
|
69
|
+
# 'url': 'https://gbpatch.itch.io/our-life',
|
|
70
|
+
# 'title': 'Our Life: Beginnings & Always',
|
|
71
|
+
# 'price': '$0.00',
|
|
72
|
+
# 'currency': 'USD',
|
|
73
|
+
# 'pub_date': 'Fri, 07 Jun 2019 23:47:57 GMT',
|
|
74
|
+
# 'update_date': 'Sun, 22 May 2022 15:48:27 GMT',
|
|
75
|
+
# 'image': 'https://img.itch.zone/aW1nLzcwMTIxNDMucG5n/315x250%23c/BalGQb.png',
|
|
76
|
+
# 'platforms': {'windows': True, 'osx': True, 'linux': True, 'android': True},
|
|
77
|
+
# }
|
|
78
|
+
```
|
|
79
|
+
|
|
80
|
+
**RSS limitations:** no rating score or count. Use HTML scraping (Approach 2) when you need ratings.
|
|
81
|
+
|
|
82
|
+
---
|
|
83
|
+
|
|
84
|
+
## Approach 2: HTML listings — ratings, genre, price, 36 games per page
|
|
85
|
+
|
|
86
|
+
```python
|
|
87
|
+
import re
|
|
88
|
+
from helpers import http_get
|
|
89
|
+
|
|
90
|
+
def parse_game_cards(html):
|
|
91
|
+
"""
|
|
92
|
+
Extract all game cards from any itch.io browse/listing/search/profile HTML page.
|
|
93
|
+
Works on:
|
|
94
|
+
https://itch.io/games/top-rated
|
|
95
|
+
https://itch.io/games/newest
|
|
96
|
+
https://itch.io/games/featured
|
|
97
|
+
https://itch.io/games/on-sale
|
|
98
|
+
https://itch.io/games/free
|
|
99
|
+
https://itch.io/games/tag-puzzle (genre/tag path)
|
|
100
|
+
https://itch.io/search?q=platformer (search — 54 cards per page)
|
|
101
|
+
https://<author>.itch.io (author profile)
|
|
102
|
+
All accept ?page=N for pagination.
|
|
103
|
+
"""
|
|
104
|
+
games = []
|
|
105
|
+
for m in re.finditer(r'data-game_id="(\d+)"', html):
|
|
106
|
+
game_id = m.group(1)
|
|
107
|
+
start = m.start()
|
|
108
|
+
chunk = html[start:start + 3000]
|
|
109
|
+
|
|
110
|
+
# Title + URL — attribute order differs between page 1 and pages 2+
|
|
111
|
+
title_m = re.search(
|
|
112
|
+
r'class="title game_link"[^>]*href="([^"]+)"[^>]*>([^<]+)</a>', chunk
|
|
113
|
+
)
|
|
114
|
+
if not title_m:
|
|
115
|
+
title_m = re.search(
|
|
116
|
+
r'href="([^"]+)"[^>]*class="title game_link"[^>]*>([^<]+)</a>', chunk
|
|
117
|
+
)
|
|
118
|
+
|
|
119
|
+
rating_m = re.search(
|
|
120
|
+
r'data-tooltip="([\d.]+) average rating from ([\d,]+) total ratings"', chunk
|
|
121
|
+
)
|
|
122
|
+
genre_m = re.search(r'class="game_genre">([^<]+)</div>', chunk)
|
|
123
|
+
price_m = re.search(r'class="price_value">([^<]+)</div>', chunk)
|
|
124
|
+
desc_m = re.search(r'class="game_text" title="([^"]+)"', chunk)
|
|
125
|
+
img_m = re.search(r'data-lazy_src="([^"]+)"', chunk)
|
|
126
|
+
platforms = re.findall(r'title="Download for ([^"]+)"', chunk)
|
|
127
|
+
|
|
128
|
+
games.append({
|
|
129
|
+
'id': game_id,
|
|
130
|
+
'url': title_m.group(1) if title_m else None,
|
|
131
|
+
'title': title_m.group(2).strip() if title_m else None,
|
|
132
|
+
'rating': float(rating_m.group(1)) if rating_m else None,
|
|
133
|
+
'rating_count': int(rating_m.group(2).replace(',', '')) if rating_m else None,
|
|
134
|
+
'genre': genre_m.group(1) if genre_m else None,
|
|
135
|
+
'price': price_m.group(1) if price_m else 'Free',
|
|
136
|
+
'description': desc_m.group(1) if desc_m else None,
|
|
137
|
+
'thumbnail': img_m.group(1) if img_m else None,
|
|
138
|
+
'platforms': platforms, # ['Windows', 'macOS', 'Linux', 'Android']
|
|
139
|
+
})
|
|
140
|
+
return games
|
|
141
|
+
|
|
142
|
+
# Usage:
|
|
143
|
+
html = http_get("https://itch.io/games/top-rated")
|
|
144
|
+
games = parse_game_cards(html)
|
|
145
|
+
# games[0] -> {
|
|
146
|
+
# 'id': '434554', 'url': 'https://gbpatch.itch.io/our-life',
|
|
147
|
+
# 'title': 'Our Life: Beginnings & Always',
|
|
148
|
+
# 'rating': 4.94, 'rating_count': 7191,
|
|
149
|
+
# 'genre': 'Visual Novel', 'price': 'Free',
|
|
150
|
+
# 'platforms': ['Windows', 'Linux', 'macOS', 'Android'],
|
|
151
|
+
# }
|
|
152
|
+
|
|
153
|
+
# Paid game example:
|
|
154
|
+
html = http_get("https://itch.io/games/top-rated?page=5")
|
|
155
|
+
games = parse_game_cards(html)
|
|
156
|
+
# Returns games where price_m captures '$7.99' when present
|
|
157
|
+
```
|
|
158
|
+
|
|
159
|
+
### CSS selector reference (for browser/JS use)
|
|
160
|
+
|
|
161
|
+
```
|
|
162
|
+
.game_cell — one card per game
|
|
163
|
+
.game_cell[data-game_id] — get game ID from attribute
|
|
164
|
+
.game_cell .title.game_link — title text + href
|
|
165
|
+
.game_cell .game_rating — rating container
|
|
166
|
+
.game_cell .game_rating[data-tooltip] — "4.94 average rating from 7,191 total ratings"
|
|
167
|
+
.game_cell .star_fill — inline style width: NN% (rating as percentage of 5)
|
|
168
|
+
.game_cell .rating_count — "(7,191)"
|
|
169
|
+
.game_cell .game_genre — genre text
|
|
170
|
+
.game_cell .price_tag .price_value — price e.g. "$7.99" (absent = Free)
|
|
171
|
+
.game_cell .game_text — one-line description (also in title attr)
|
|
172
|
+
.game_cell .game_author a — author name + href
|
|
173
|
+
.game_cell img.lazy_loaded — thumbnail (src in data-lazy_src before JS runs)
|
|
174
|
+
```
|
|
175
|
+
|
|
176
|
+
**Gotcha — attribute order flips on page >= 2.** Page 1 uses `class="..." data-game_id="..."`, page 2+ uses `data-game_id="..." class="..."`. The regex above handles both. If you use a CSS selector engine, `[data-game_id]` is unambiguous.
|
|
177
|
+
|
|
178
|
+
**Gotcha — ratings absent on some listing types.** The tag/genre browse pages (e.g. `/games/tag-puzzle`) sometimes omit the rating tooltip on the card even when the game has ratings. Fetch the detail page for the authoritative rating.
|
|
179
|
+
|
|
180
|
+
---
|
|
181
|
+
|
|
182
|
+
## Approach 3: Game detail page — JSON-LD Product schema
|
|
183
|
+
|
|
184
|
+
The cleanest source for individual game data. All confirmed fields:
|
|
185
|
+
|
|
186
|
+
```python
|
|
187
|
+
import json, re
|
|
188
|
+
from helpers import http_get
|
|
189
|
+
|
|
190
|
+
def extract_game_detail(url):
|
|
191
|
+
"""
|
|
192
|
+
Fetch full metadata for a single itch.io game.
|
|
193
|
+
url format: https://<author>.itch.io/<game-slug>
|
|
194
|
+
"""
|
|
195
|
+
html = http_get(url)
|
|
196
|
+
|
|
197
|
+
# --- JSON-LD (always present, covers name/description/price/rating) ---
|
|
198
|
+
ld_product = None
|
|
199
|
+
for block in re.findall(
|
|
200
|
+
r'<script[^>]*type="application/ld\+json"[^>]*>(.*?)</script>',
|
|
201
|
+
html, re.DOTALL
|
|
202
|
+
):
|
|
203
|
+
ld = json.loads(block.strip())
|
|
204
|
+
if ld.get('@type') == 'Product':
|
|
205
|
+
ld_product = ld
|
|
206
|
+
break
|
|
207
|
+
|
|
208
|
+
# --- Info panel table (Status, Platforms, Genre, Tags, Author, etc.) ---
|
|
209
|
+
info = {}
|
|
210
|
+
panel_m = re.search(
|
|
211
|
+
r'class="game_info_panel_widget[^"]*"[^>]*><table>(.*?)</table>',
|
|
212
|
+
html, re.DOTALL
|
|
213
|
+
)
|
|
214
|
+
if panel_m:
|
|
215
|
+
for row in re.finditer(
|
|
216
|
+
r'<tr><td>([^<]+)</td><td>(.*?)</td></tr>',
|
|
217
|
+
panel_m.group(1), re.DOTALL
|
|
218
|
+
):
|
|
219
|
+
key = row.group(1).strip()
|
|
220
|
+
val = re.sub(r'<[^>]+>', '', row.group(2)).strip()
|
|
221
|
+
# Multi-value fields become lists (Tags, Platforms, Genre, Links)
|
|
222
|
+
info[key] = [v.strip() for v in val.split(',')] if ',' in val else val
|
|
223
|
+
|
|
224
|
+
# --- Cover image ---
|
|
225
|
+
cover_m = re.search(r'<meta property="og:image" content="([^"]+)"', html)
|
|
226
|
+
|
|
227
|
+
offers = (ld_product or {}).get('offers', {})
|
|
228
|
+
agg = (ld_product or {}).get('aggregateRating', {})
|
|
229
|
+
|
|
230
|
+
return {
|
|
231
|
+
'url': url,
|
|
232
|
+
'name': (ld_product or {}).get('name'),
|
|
233
|
+
'description': (ld_product or {}).get('description'),
|
|
234
|
+
'price': offers.get('price'), # "0.00" for free, "7.99" for paid
|
|
235
|
+
'currency': offers.get('priceCurrency'), # "USD"
|
|
236
|
+
'rating': agg.get('ratingValue'), # "4.9" string
|
|
237
|
+
'rating_count': agg.get('ratingCount'), # int
|
|
238
|
+
'cover': cover_m.group(1) if cover_m else None,
|
|
239
|
+
'info': info,
|
|
240
|
+
}
|
|
241
|
+
|
|
242
|
+
# Free game:
|
|
243
|
+
r = extract_game_detail("https://gbpatch.itch.io/our-life")
|
|
244
|
+
# {
|
|
245
|
+
# 'name': 'Our Life: Beginnings & Always',
|
|
246
|
+
# 'description': 'Grow from childhood to adulthood with the lonely boy next door...',
|
|
247
|
+
# 'price': None, 'currency': None, <- no 'offers' block for free games
|
|
248
|
+
# 'rating': '4.9', 'rating_count': 7191,
|
|
249
|
+
# 'cover': 'https://img.itch.zone/aW1hZ2Uv.../347x500/7HqrvV.jpg',
|
|
250
|
+
# 'info': {
|
|
251
|
+
# 'Status': 'Released',
|
|
252
|
+
# 'Platforms': ['Windows', 'macOS', 'Linux', 'Android'],
|
|
253
|
+
# 'Rating': 'Rated 4.9 out of 5 stars(7,191 total ratings)',
|
|
254
|
+
# 'Author': 'GBPatch',
|
|
255
|
+
# 'Genre': ['Visual Novel', 'Interactive Fiction'],
|
|
256
|
+
# 'Tags': ['Amare', 'Comedy', 'Dating Sim', 'Gay', 'LGBT', ...],
|
|
257
|
+
# 'Links': 'Steam',
|
|
258
|
+
# }
|
|
259
|
+
# }
|
|
260
|
+
|
|
261
|
+
# Paid game:
|
|
262
|
+
r = extract_game_detail("https://adamgryu.itch.io/a-short-hike")
|
|
263
|
+
# {
|
|
264
|
+
# 'name': 'A Short Hike',
|
|
265
|
+
# 'price': '7.99', 'currency': 'USD',
|
|
266
|
+
# 'rating': '4.9', 'rating_count': 4307,
|
|
267
|
+
# 'info': {
|
|
268
|
+
# 'Status': 'Released',
|
|
269
|
+
# 'Platforms': ['Windows', 'macOS', 'Linux'],
|
|
270
|
+
# 'Release date': 'Jul 30, 2019',
|
|
271
|
+
# 'Genre': ['Adventure', 'Platformer'],
|
|
272
|
+
# 'Made with': 'Unity',
|
|
273
|
+
# 'Tags': ['3D', 'Atmospheric', 'Cute', 'Relaxing', 'Short', ...],
|
|
274
|
+
# 'Average session': 'About an hour',
|
|
275
|
+
# 'Languages': ['English', 'Spanish; Latin America', 'French', ...],
|
|
276
|
+
# 'Inputs': ['Keyboard', 'Mouse', 'Xbox controller', ...],
|
|
277
|
+
# 'Accessibility': ['Subtitles', 'Configurable controls'],
|
|
278
|
+
# 'Links': ['Steam', 'Homepage', 'Soundtrack', 'Twitter/X'],
|
|
279
|
+
# }
|
|
280
|
+
# }
|
|
281
|
+
```
|
|
282
|
+
|
|
283
|
+
**JSON-LD available fields:**
|
|
284
|
+
|
|
285
|
+
| Field | Free game | Paid game |
|
|
286
|
+
|---|---|---|
|
|
287
|
+
| `@type` | `Product` | `Product` |
|
|
288
|
+
| `name` | yes | yes |
|
|
289
|
+
| `description` | yes | yes |
|
|
290
|
+
| `aggregateRating.ratingValue` | yes | yes |
|
|
291
|
+
| `aggregateRating.ratingCount` | yes | yes |
|
|
292
|
+
| `offers.price` | absent | yes ("7.99") |
|
|
293
|
+
| `offers.priceCurrency` | absent | yes ("USD") |
|
|
294
|
+
| `offers.seller.name` | absent | yes (author name) |
|
|
295
|
+
| `offers.seller.url` | absent | yes (author profile URL) |
|
|
296
|
+
|
|
297
|
+
---
|
|
298
|
+
|
|
299
|
+
## Pagination
|
|
300
|
+
|
|
301
|
+
Browse pages: `?page=N`. Detect end of results by HTTP 404 (page too high) or absent `<link rel="next">`.
|
|
302
|
+
|
|
303
|
+
```python
|
|
304
|
+
import re
|
|
305
|
+
from helpers import http_get
|
|
306
|
+
|
|
307
|
+
def paginate_listing(base_url, max_pages=10):
|
|
308
|
+
"""
|
|
309
|
+
Scrape multiple pages from any itch.io browse URL.
|
|
310
|
+
base_url: https://itch.io/games/top-rated (no ?page= suffix)
|
|
311
|
+
Returns flat list of game dicts.
|
|
312
|
+
Stops when HTTP 404 or no <link rel="next"> found.
|
|
313
|
+
"""
|
|
314
|
+
all_games = []
|
|
315
|
+
page = 1
|
|
316
|
+
while page <= max_pages:
|
|
317
|
+
url = base_url if page == 1 else f"{base_url}?page={page}"
|
|
318
|
+
try:
|
|
319
|
+
html = http_get(url)
|
|
320
|
+
except Exception:
|
|
321
|
+
break # 404 = past last page
|
|
322
|
+
all_games.extend(parse_game_cards(html))
|
|
323
|
+
if not re.search(r'<link[^>]+rel="next"[^>]*/>', html):
|
|
324
|
+
break
|
|
325
|
+
page += 1
|
|
326
|
+
return all_games
|
|
327
|
+
|
|
328
|
+
# Confirmed: page 1 has <link href="?page=2" rel="next"/>
|
|
329
|
+
# page 2 has <link rel="prev" href="/games/top-rated"/> and <link rel="next" href="?page=3"/>
|
|
330
|
+
# past last page returns HTTP 404
|
|
331
|
+
# top-rated has at least 200 pages (each 36 games); page 300+ -> 404
|
|
332
|
+
```
|
|
333
|
+
|
|
334
|
+
---
|
|
335
|
+
|
|
336
|
+
## Browse URL patterns
|
|
337
|
+
|
|
338
|
+
All confirmed working via `http_get`:
|
|
339
|
+
|
|
340
|
+
```python
|
|
341
|
+
BASE = "https://itch.io/games"
|
|
342
|
+
|
|
343
|
+
# Sort orders
|
|
344
|
+
f"{BASE}/top-rated" # all-time top rated (rated by community, 0–5 stars)
|
|
345
|
+
f"{BASE}/newest" # most recently published
|
|
346
|
+
f"{BASE}/featured" # itch.io staff picks
|
|
347
|
+
f"{BASE}/on-sale" # discounted games
|
|
348
|
+
f"{BASE}/free" # free games only
|
|
349
|
+
|
|
350
|
+
# Genre/tag paths (append .xml for RSS)
|
|
351
|
+
f"{BASE}/tag-puzzle" # tag slug — prefix with 'tag-'
|
|
352
|
+
f"{BASE}/genre-action" # genre — prefix with 'genre-' (less common)
|
|
353
|
+
|
|
354
|
+
# Combine: tag + sort via separate pages (no combined URL that survives http_get)
|
|
355
|
+
# Note: https://itch.io/games/top-rated/tag-puzzle -> HTTP 403
|
|
356
|
+
# Note: ?tag= query param does NOT filter server-side (returns same games)
|
|
357
|
+
|
|
358
|
+
# Pagination
|
|
359
|
+
f"{BASE}/top-rated?page=2"
|
|
360
|
+
f"{BASE}/tag-puzzle?page=3"
|
|
361
|
+
|
|
362
|
+
# RSS equivalents (36 items, no pagination needed for small sets)
|
|
363
|
+
f"{BASE}/top-rated.xml"
|
|
364
|
+
f"{BASE}/tag-puzzle.xml"
|
|
365
|
+
f"{BASE}/tag-puzzle.xml?page=2"
|
|
366
|
+
|
|
367
|
+
# Search (54 results/page, no server-side pagination beyond page 1 via http_get)
|
|
368
|
+
"https://itch.io/search?q=platformer"
|
|
369
|
+
|
|
370
|
+
# Author profile
|
|
371
|
+
"https://<author-slug>.itch.io"
|
|
372
|
+
```
|
|
373
|
+
|
|
374
|
+
---
|
|
375
|
+
|
|
376
|
+
## API (requires key)
|
|
377
|
+
|
|
378
|
+
itch.io has an official REST API. A free key is issued per-account with no rate limit published.
|
|
379
|
+
Get one at: `https://itch.io/user/settings/api-keys`
|
|
380
|
+
|
|
381
|
+
Base URL: `https://itch.io/api/1/<key>/`
|
|
382
|
+
|
|
383
|
+
```python
|
|
384
|
+
import json
|
|
385
|
+
from helpers import http_get
|
|
386
|
+
|
|
387
|
+
ITCH_KEY = "your_api_key_here" # from https://itch.io/user/settings/api-keys
|
|
388
|
+
|
|
389
|
+
def api(path):
|
|
390
|
+
return json.loads(http_get(f"https://itch.io/api/1/{ITCH_KEY}/{path}"))
|
|
391
|
+
|
|
392
|
+
# Authenticated user info
|
|
393
|
+
api("me")
|
|
394
|
+
# -> {"user": {"id": ..., "username": "...", "url": "...", "display_name": "...", ...}}
|
|
395
|
+
|
|
396
|
+
# Games owned by authenticated user
|
|
397
|
+
api("my-games")
|
|
398
|
+
# -> {"games": [{"id": ..., "title": "...", "url": "...", "created_at": "...",
|
|
399
|
+
# "published": true/false, "min_price": 0, ...}, ...]}
|
|
400
|
+
|
|
401
|
+
# Download keys for a game (owner only)
|
|
402
|
+
api("game/434554/download_keys")
|
|
403
|
+
|
|
404
|
+
# Credentials (for authenticated purchases)
|
|
405
|
+
api("game/434554/credentials")
|
|
406
|
+
```
|
|
407
|
+
|
|
408
|
+
**Error structure:** invalid/missing key returns `{"errors": ["invalid key"]}` with HTTP 200.
|
|
409
|
+
Non-existent endpoints return HTTP 404.
|
|
410
|
+
|
|
411
|
+
**No unauthenticated game lookup API.** `https://itch.io/api/1/x/games` -> HTTP 404.
|
|
412
|
+
Use HTML scraping or RSS for unauthenticated game data.
|
|
413
|
+
|
|
414
|
+
---
|
|
415
|
+
|
|
416
|
+
## Gotchas
|
|
417
|
+
|
|
418
|
+
1. **Attribute order flips page 1 vs 2+.** On page 1, game cards use `class="game_cell ..." data-game_id="..."`. On pages 2+, the order is `data-game_id="..." class="game_cell ..."`. Always match `data-game_id` independently of class ordering.
|
|
419
|
+
|
|
420
|
+
2. **Ratings absent on tag/genre listing pages.** The `data-tooltip` with rating is often missing from card HTML on `/games/tag-*` pages even though the game has ratings. Fetch the detail page for `aggregateRating` via JSON-LD.
|
|
421
|
+
|
|
422
|
+
3. **`price_value` absent = Free.** Paid games have `<div class="price_tag meta_tag" title="Pay $7.99 or more..."><div class="price_value">$7.99</div></div>`. Free games have no such element. Default to `'Free'` when absent.
|
|
423
|
+
|
|
424
|
+
4. **Free-game JSON-LD has no `offers` block.** Only paid games include the `offers` object. For free games, use absence of `offers` as the signal, not presence of `price: 0`.
|
|
425
|
+
|
|
426
|
+
5. **`/games/top-rated/tag-puzzle` returns HTTP 403.** Cannot combine sort + tag in a path. Use separate `/games/tag-puzzle` (top-rated is the default sort anyway).
|
|
427
|
+
|
|
428
|
+
6. **`?tag=` query param is ignored server-side.** `https://itch.io/games/top-rated?tag=puzzle` returns the same games as `?top-rated`. Use `/games/tag-puzzle` path instead.
|
|
429
|
+
|
|
430
|
+
7. **Download/purchase counts are not public.** No count field appears anywhere in the public HTML, JSON-LD, RSS, or unauthenticated API. Game owners see their stats in the dashboard only.
|
|
431
|
+
|
|
432
|
+
8. **Search beyond page 1 is AJAX-only.** `https://itch.io/search?q=X&page=2` via `http_get` returns the same 54 results as page 1. To get more search results use the browser and scroll/click "load more".
|
|
433
|
+
|
|
434
|
+
9. **RSS is capped at 36 items per page.** Paginate with `?page=N`. Very high page numbers (300+) return HTTP 404 on browse pages.
|
|
435
|
+
|
|
436
|
+
10. **Unicode zero-width space in some titles.** `\u200b` (zero-width space) appears at the start of certain titles (e.g. "Our Life: Beginnings & Always"). Strip with `.replace('\u200b', '').strip()` or `.strip()` alone won't remove it — use `title.replace('\u200b', '').strip()`.
|