@pencil-agent/nano-pencil 2.0.0-beta.9 → 2.0.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
Files changed (207) hide show
  1. package/README.md +267 -267
  2. package/dist/build-meta.json +3 -3
  3. package/dist/core/export-html/AGENT.md +11 -11
  4. package/dist/core/export-html/template.css +971 -971
  5. package/dist/core/export-html/template.html +54 -54
  6. package/dist/core/extensions-host/index.d.ts +1 -1
  7. package/dist/core/extensions-host/types.d.ts +5 -8
  8. package/dist/extensions/builtin/AGENT.md +115 -115
  9. package/dist/extensions/builtin/browser/AGENT.md +17 -17
  10. package/dist/extensions/builtin/browser/agent-workspace/agent_helpers.py +12 -12
  11. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/amazon/product-search.md +198 -198
  12. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/archive-org/scraping.md +341 -341
  13. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/arxiv/scraping.md +311 -311
  14. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/arxiv-bulk/scraping.md +333 -333
  15. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/atlas/overview.md +70 -70
  16. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/booking-com/scraping.md +578 -578
  17. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/capterra/scraping.md +440 -440
  18. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/centilebrain/generate-estimates.md +110 -110
  19. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/coingecko/scraping.md +325 -325
  20. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/coinmarketcap/scraping.md +463 -463
  21. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/coursera/scraping.md +360 -360
  22. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/craigslist/scraping.md +390 -390
  23. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/crossref/scraping.md +568 -568
  24. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/dev-to/scraping.md +323 -323
  25. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/duckduckgo/scraping.md +349 -349
  26. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/ebay/scraping.md +435 -435
  27. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/etsy/scraping.md +506 -506
  28. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/eventbrite/scraping.md +363 -363
  29. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/expedia/automation.md +168 -168
  30. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/facebook/groups.md +236 -236
  31. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/facebook/pages.md +295 -295
  32. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/framer/editor.md +108 -108
  33. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/fred/scraping.md +493 -493
  34. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/g2/scraping.md +580 -580
  35. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/genius/scraping.md +511 -511
  36. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/github/repo-actions.md +65 -65
  37. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/github/scraping.md +184 -184
  38. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/glassdoor/scraping.md +543 -543
  39. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/gmail/compose.md +122 -122
  40. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/goodreads/scraping.md +461 -461
  41. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/gutenberg/scraping.md +383 -383
  42. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/hackernews/scraping.md +243 -243
  43. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/howlongtobeat/scraping.md +473 -473
  44. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/imdb/scraping.md +271 -271
  45. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/itch-io/scraping.md +436 -436
  46. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/job-boards/indeed-glassdoor.md +1021 -1021
  47. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/letterboxd/scraping.md +349 -349
  48. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/linkedin/invitation-manager.md +109 -109
  49. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/loom/folder-enumeration.md +170 -170
  50. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/macrotrends/scraping.md +537 -537
  51. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/medium/article-hydration.md +120 -120
  52. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/medium/scraping.md +414 -414
  53. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/metacritic/scraping.md +477 -477
  54. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/musicbrainz/scraping.md +478 -478
  55. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/nasa/scraping.md +339 -339
  56. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/news-aggregation/multi-source.md +205 -205
  57. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/open-library/scraping.md +472 -472
  58. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/openalex/scraping.md +470 -470
  59. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/openstreetmap/scraping.md +490 -490
  60. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/package-registries/npm-pypi.md +478 -478
  61. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/polymarket/scraping.md +234 -234
  62. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/producthunt/scraping.md +307 -307
  63. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/pubmed/scraping.md +421 -421
  64. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/quora/scraping.md +364 -364
  65. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/rawg/scraping.md +352 -352
  66. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/reddit/scraping.md +124 -124
  67. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/rest-countries/scraping.md +233 -233
  68. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/sec-edgar/scraping.md +361 -361
  69. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/shopify-admin/README.md +36 -36
  70. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/shopify-admin/embedded-apps.md +72 -72
  71. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/shopify-admin/knowledge-base.md +109 -109
  72. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/shopify-admin/polaris-inputs.md +137 -137
  73. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/soundcloud/scraping.md +362 -362
  74. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/spotify/scraping.md +339 -339
  75. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/stackoverflow/scraping.md +435 -435
  76. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/steam/scraping.md +575 -575
  77. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/substack/scraping.md +338 -338
  78. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/thetechgeeks/pricing.md +52 -52
  79. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/tiktok/upload.md +107 -107
  80. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/tradingview/scraping.md +309 -309
  81. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/trello/boards-and-lists.md +88 -88
  82. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/trustpilot/scraping.md +375 -375
  83. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/walmart/scraping.md +444 -444
  84. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/wayback-machine/scraping.md +306 -306
  85. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/weather/scraping.md +398 -398
  86. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/wellfound/scraping.md +596 -596
  87. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/world-bank/scraping.md +356 -356
  88. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/xiaohongshu/scraping.md +84 -84
  89. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/youtube/scraping.md +418 -418
  90. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/zillow/scraping.md +433 -433
  91. package/dist/extensions/builtin/browser/browser.md +73 -73
  92. package/dist/extensions/builtin/browser/install.md +142 -142
  93. package/dist/extensions/builtin/browser/interaction-skills/connection.md +48 -48
  94. package/dist/extensions/builtin/browser/interaction-skills/cookies.md +3 -3
  95. package/dist/extensions/builtin/browser/interaction-skills/cross-origin-iframes.md +3 -3
  96. package/dist/extensions/builtin/browser/interaction-skills/dialogs.md +64 -64
  97. package/dist/extensions/builtin/browser/interaction-skills/downloads.md +3 -3
  98. package/dist/extensions/builtin/browser/interaction-skills/drag-and-drop.md +3 -3
  99. package/dist/extensions/builtin/browser/interaction-skills/dropdowns.md +3 -3
  100. package/dist/extensions/builtin/browser/interaction-skills/iframes.md +3 -3
  101. package/dist/extensions/builtin/browser/interaction-skills/network-requests.md +3 -3
  102. package/dist/extensions/builtin/browser/interaction-skills/print-as-pdf.md +3 -3
  103. package/dist/extensions/builtin/browser/interaction-skills/profile-sync.md +90 -90
  104. package/dist/extensions/builtin/browser/interaction-skills/screenshots.md +17 -17
  105. package/dist/extensions/builtin/browser/interaction-skills/scrolling.md +3 -3
  106. package/dist/extensions/builtin/browser/interaction-skills/shadow-dom.md +3 -3
  107. package/dist/extensions/builtin/browser/interaction-skills/tabs.md +69 -69
  108. package/dist/extensions/builtin/browser/interaction-skills/uploads.md +1 -1
  109. package/dist/extensions/builtin/browser/interaction-skills/viewport.md +3 -3
  110. package/dist/extensions/builtin/browser/src/browser_harness/AGENT.md +15 -15
  111. package/dist/extensions/builtin/browser/src/browser_harness/__init__.py +8 -8
  112. package/dist/extensions/builtin/browser/src/browser_harness/_ipc.py +90 -90
  113. package/dist/extensions/builtin/browser/src/browser_harness/admin.py +722 -722
  114. package/dist/extensions/builtin/browser/src/browser_harness/daemon.py +328 -328
  115. package/dist/extensions/builtin/browser/src/browser_harness/helpers.py +396 -396
  116. package/dist/extensions/builtin/browser/src/browser_harness/run.py +103 -103
  117. package/dist/extensions/builtin/discipline/skills/brainstorming/SKILL.md +33 -33
  118. package/dist/extensions/builtin/discipline/skills/executing-plans/SKILL.md +25 -25
  119. package/dist/extensions/builtin/discipline/skills/finishing-development-branch/SKILL.md +25 -25
  120. package/dist/extensions/builtin/discipline/skills/receiving-code-review/SKILL.md +22 -22
  121. package/dist/extensions/builtin/discipline/skills/requesting-code-review/SKILL.md +31 -31
  122. package/dist/extensions/builtin/discipline/skills/systematic-debugging/SKILL.md +28 -28
  123. package/dist/extensions/builtin/discipline/skills/test-driven-development/SKILL.md +32 -32
  124. package/dist/extensions/builtin/discipline/skills/using-git-worktrees/SKILL.md +25 -25
  125. package/dist/extensions/builtin/discipline/skills/verification-before-completion/SKILL.md +27 -27
  126. package/dist/extensions/builtin/discipline/skills/writing-plans/SKILL.md +26 -26
  127. package/dist/extensions/builtin/goal/README.md +67 -67
  128. package/dist/extensions/builtin/goal/goal-controller.js +1 -1
  129. package/dist/extensions/builtin/goal/goal-prompts.js +4 -4
  130. package/dist/extensions/builtin/grub/README.md +112 -112
  131. package/dist/extensions/builtin/link-world/agent-workspace/README.md +16 -16
  132. package/dist/extensions/builtin/link-world/internet-search/internet-search.md +65 -65
  133. package/dist/extensions/builtin/link-world/link-world-agent.md +82 -82
  134. package/dist/extensions/builtin/link-world/linkworld.md +313 -313
  135. package/dist/extensions/builtin/link-world/network-routing/network-routing.md +67 -67
  136. package/dist/extensions/builtin/loop/README.md +92 -92
  137. package/dist/extensions/builtin/mcp/figma-design.md +68 -68
  138. package/dist/extensions/builtin/mcp/mcp-management.md +85 -85
  139. package/dist/extensions/builtin/recap/AGENT.md +15 -15
  140. package/dist/extensions/builtin/sal/README.md +72 -72
  141. package/dist/extensions/builtin/security-audit/README.md +289 -289
  142. package/dist/extensions/builtin/team/AGENT.md +112 -112
  143. package/dist/extensions/builtin/team/TESTING.md +299 -299
  144. package/dist/extensions/builtin/token-save/README.md +56 -56
  145. package/dist/extensions/optional/AGENT.md +10 -10
  146. package/dist/index.d.ts +5 -30
  147. package/dist/index.js +1 -1
  148. package/dist/models.d.ts +7 -0
  149. package/dist/models.js +1 -0
  150. package/dist/modes/interactive/theme/dark.json +85 -85
  151. package/dist/modes/interactive/theme/light.json +84 -84
  152. package/dist/modes/interactive/theme/theme-schema.json +335 -335
  153. package/dist/modes/interactive/theme/warm.json +81 -81
  154. package/dist/node_modules/@pencil-agent/ai/dist/cli.js +0 -0
  155. package/dist/packages/protocol/src/flags.d.ts +20 -0
  156. package/dist/packages/protocol/src/flags.js +0 -0
  157. package/dist/packages/protocol/src/hooks.d.ts +17 -0
  158. package/dist/packages/protocol/src/hooks.js +0 -0
  159. package/dist/packages/protocol/src/index.d.ts +4 -2
  160. package/dist/packages/protocol/src/index.js +1 -1
  161. package/dist/packages/protocol/src/lifecycle.d.ts +11 -21
  162. package/dist/public-config.d.ts +12 -0
  163. package/dist/public-config.js +1 -0
  164. package/dist/runtime.d.ts +9 -0
  165. package/dist/runtime.js +1 -0
  166. package/dist/session-compaction.d.ts +7 -0
  167. package/dist/session-compaction.js +1 -0
  168. package/dist/session.d.ts +7 -0
  169. package/dist/session.js +1 -0
  170. package/dist/skills.d.ts +7 -0
  171. package/dist/skills.js +1 -0
  172. package/dist/tools.d.ts +7 -0
  173. package/dist/tools.js +1 -0
  174. package/docs/ACP/345/215/217/350/256/256/351/233/206/346/210/220/345/274/200/345/217/221/346/226/207/346/241/243.md +851 -0
  175. package/docs/SDK-TESTING.md +364 -0
  176. package/docs/codex-goal-command-impl.md +1055 -1055
  177. package/docs/codex-goal-vs-grub.md +500 -500
  178. package/docs/custom-provider.md +27 -27
  179. package/docs/extensions.md +27 -27
  180. package/docs/keybindings.md +27 -27
  181. package/docs/loop /351/207/215/346/236/204/345/256/214/346/210/220/346/200/273/347/273/223.md" +250 -250
  182. package/docs/loop /351/207/215/346/236/204/345/256/214/346/210/220/346/212/245/345/221/212.md" +122 -122
  183. package/docs/loop /351/207/215/346/236/204/346/226/271/346/241/210.md" +1222 -1222
  184. package/docs/loop /351/207/215/346/236/204/346/226/271/346/241/210/345/256/236/347/216/260/346/212/245/345/221/212.md" +158 -158
  185. package/docs/loop /351/207/215/346/236/204/346/226/271/346/241/210/345/257/271/346/257/224/345/210/206/346/236/220.md" +128 -128
  186. package/docs/loop /351/207/215/346/236/204/350/256/241/345/210/222.md" +320 -320
  187. package/docs/loop-usage-examples.md +214 -214
  188. package/docs/mem-core/346/212/200/346/234/257/346/226/207/346/241/243.md +593 -0
  189. package/docs/models.md +27 -27
  190. package/docs/packages.md +27 -27
  191. package/docs/pi-design-philosophy.md +457 -457
  192. package/docs/planmode.md +1987 -1987
  193. package/docs/prompt-templates.md +27 -27
  194. package/docs/providers.md +27 -27
  195. package/docs/sdk.md +27 -27
  196. package/docs/skills.md +27 -27
  197. package/docs/startup-performance-optimization.md +301 -0
  198. package/docs/themes.md +27 -27
  199. package/docs/tui.md +27 -27
  200. package/docs//350/256/244/347/237/245/345/234/260/345/233/276.md +47 -0
  201. package/package.json +190 -162
  202. package/docs/cc-agent-design.md +0 -1297
  203. package/docs/cc-tui-design.md +0 -1333
  204. package/docs/nanoPencil-/345/255/246/344/271/240/350/256/241/345/210/222.md +0 -170
  205. package/docs/scan-report.md +0 -3820
  206. package/docs//345/257/271/346/240/207Claude-Code.md +0 -1775
  207. package/docs//351/230/277/351/207/214/345/267/264/345/267/264/350/264/242/346/212/245/345/210/206/346/236/220/344/271/246.md +0 -261
@@ -1,578 +1,578 @@
1
- # Booking.com — Scraping & Data Extraction
2
-
3
- Field-tested against booking.com on 2026-04-18 using `http_get` and the
4
- `dml/graphql` JSON API. All tests run without a browser session.
5
-
6
- ---
7
-
8
- ## TL;DR
9
-
10
- **`http_get` returns nothing useful from booking.com.** Every HTML page —
11
- search results, hotel pages, city pages, the homepage — is intercepted by an
12
- AWS WAF JS challenge before any content is served. The challenge requires
13
- JavaScript execution to complete a cryptographic puzzle and set an
14
- `aws-waf-token` cookie. Without a real browser, you get a ~4-8 KB stub page.
15
-
16
- **What you can do without a browser:**
17
- - Enumerate hotel/city/region URLs from XML sitemaps (Googlebot UA required).
18
- - Read `robots.txt` for URL pattern documentation.
19
- - Query the GraphQL endpoint `https://www.booking.com/dml/graphql` for schema
20
- exploration (no auth = internal errors, but validation errors reveal the
21
- schema).
22
-
23
- **For all actual data extraction, use the browser (`goto` + `js`).**
24
-
25
- ---
26
-
27
- ## AWS WAF JS Challenge — What It Is
28
-
29
- Every `http_get` request to `www.booking.com` receives one of two variants of
30
- a WAF stub:
31
-
32
- **Variant A (~3,962 bytes) — modern SDK:**
33
- ```html
34
- <script src="https://www.booking.com/__challenge_{KEY}/{HASH}/challenge.js"></script>
35
- <script>
36
- AwsWafIntegration.getToken().then(() => { window.location.href = newHref; });
37
- </script>
38
- ```
39
-
40
- **Variant B (~8,410 bytes) — with AJAX error reporting:**
41
- Same AWS WAF SDK, plus an `XMLHttpRequest`-based error reporter that POSTs to
42
- `https://reports.booking.com/chal_report`. This variant is more common on
43
- non-browser UA strings.
44
-
45
- **Detection in your code:**
46
- ```python
47
- def is_waf_blocked(html: str) -> bool:
48
- return (
49
- 'AwsWafIntegration' in html
50
- or 'awsWafCookieDomainList' in html
51
- or 'challenge.js' in html
52
- or len(html) < 10_000 and '<title></title>' in html
53
- )
54
- ```
55
-
56
- **What the challenge does:**
57
- 1. Loads a 1.3 MB obfuscated JS file (`challenge.js`) from a path-keyed URL.
58
- 2. Executes a cryptographic proof-of-work puzzle client-side.
59
- 3. Sets an `aws-waf-token` cookie on the `booking.com` domain.
60
- 4. Redirects to the original URL with `?chal_t={timestamp}&force_referer=`
61
- appended.
62
-
63
- This challenge **cannot be solved by `http_get`**. It requires a real JS
64
- engine. A `bkng` session cookie is set on the first blocked response, but it
65
- has no value without the WAF token.
66
-
67
- **User agents tested — all blocked:**
68
- - Chrome desktop (`Mozilla/5.0 ... Chrome/120`)
69
- - iPhone/Safari mobile
70
- - `Googlebot/2.1` (HTML pages only; sitemaps are whitelisted)
71
- - Default `urllib` UA
72
-
73
- ---
74
-
75
- ## What `http_get` CAN Access
76
-
77
- ### 1. XML Sitemaps (URL discovery)
78
-
79
- Booking.com whitelists sitemap paths for Googlebot. This lets you enumerate
80
- millions of property, city, region, and attraction URLs without a browser.
81
-
82
- ```python
83
- import gzip, re, urllib.request
84
-
85
- GOOGLEBOT = {"User-Agent": "Googlebot/2.1 (+http://www.google.com/bot.html)"}
86
-
87
- def fetch_sitemap_index(url: str) -> list[str]:
88
- """Returns list of child sitemap URLs from an index sitemap."""
89
- xml = http_get(url, headers=GOOGLEBOT)
90
- return re.findall(r'<loc>(https://[^<]+)</loc>', xml)
91
-
92
- def fetch_sitemap_gz(gz_url: str) -> list[str]:
93
- """Decompresses a gzipped sitemap and returns all <loc> URLs."""
94
- req = urllib.request.Request(gz_url, headers=GOOGLEBOT)
95
- with urllib.request.urlopen(req, timeout=30) as r:
96
- data = gzip.decompress(r.read())
97
- return re.findall(r'<loc>(https://[^<]+)</loc>', data.decode())
98
-
99
- # Example: get all en-gb hotel URLs
100
- hotel_idx = http_get(
101
- "https://www.booking.com/sitembk-hotel-index.xml",
102
- headers=GOOGLEBOT
103
- )
104
- # 74 shards for en-gb; each shard has ~45,000-50,000 property URLs
105
- en_gb_shards = re.findall(
106
- r'<loc>(https://www\.booking\.com/sitembk-hotel-en-gb\.\d+\.xml\.gz)</loc>',
107
- hotel_idx
108
- )
109
- # hotel_urls = fetch_sitemap_gz(en_gb_shards[0]) # ~50K URLs per shard
110
- ```
111
-
112
- **Available sitemap categories (confirmed, 275 total):**
113
-
114
- | Index URL | Content |
115
- |-----------|---------|
116
- | `sitembk-hotel-index.xml` | All properties (~74 en-gb shards, ~3.5M URLs) |
117
- | `sitembk-city-index.xml` | City landing pages (~6 en-gb shards, ~44K cities) |
118
- | `sitembk-region-index.xml` | Region landing pages |
119
- | `sitembk-country-index.xml` | Country landing pages |
120
- | `sitembk-attractions-index.xml` | Attractions |
121
- | `sitembk-hotel-review-index.xml` | Review pages |
122
- | `sitembk-themed-city-{type}-index.xml` | Category-specific city pages (70+ types: hostels, luxury, spa, ski, etc.) |
123
-
124
- ### 2. `robots.txt`
125
-
126
- ```python
127
- robots = http_get("https://www.booking.com/robots.txt", headers={"User-Agent": "Mozilla/5.0"})
128
- ```
129
-
130
- - Returns immediately, no WAF.
131
- - 136 Disallow entries, 275 Sitemap declarations.
132
- - Documents all URL structures (search results, hotel pages, booking flow, etc.).
133
-
134
- ### 3. GraphQL Schema Exploration (no auth)
135
-
136
- The endpoint `https://www.booking.com/dml/graphql` is **not WAF-protected**.
137
- It accepts POST requests and returns JSON. Without a session, most queries
138
- return `Internal Server Error` from the backend (`irene` service), but
139
- **GraphQL validation errors fire before the backend** and reveal the schema.
140
-
141
- ```python
142
- import json, urllib.request, gzip
143
-
144
- GQL_URL = "https://www.booking.com/dml/graphql?lang=en-gb"
145
- GQL_HEADERS = {
146
- "User-Agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36",
147
- "Accept": "application/json",
148
- "Content-Type": "application/json",
149
- "Origin": "https://www.booking.com",
150
- "Referer": "https://www.booking.com/searchresults.html",
151
- "x-booking-context-action-name": "searchresults",
152
- "x-booking-context-aid": "376510",
153
- "x-booking-site-type-id": "1",
154
- }
155
-
156
- def gql(operation_name: str, query: str, variables: dict = None) -> dict:
157
- payload = {"operationName": operation_name, "query": query}
158
- if variables:
159
- payload["variables"] = variables
160
- req = urllib.request.Request(
161
- GQL_URL,
162
- data=json.dumps(payload).encode(),
163
- headers=GQL_HEADERS,
164
- method="POST"
165
- )
166
- with urllib.request.urlopen(req, timeout=20) as r:
167
- data = r.read()
168
- if r.headers.get("Content-Encoding") == "gzip":
169
- data = gzip.decompress(data)
170
- return json.loads(data.decode())
171
- ```
172
-
173
- **Confirmed Query type fields (schema, field-tested 2026-04-18):**
174
-
175
- | Field | Input type | Notes |
176
- |-------|-----------|-------|
177
- | `searchQueries` | none | Root for hotel search; nested `.search(SearchQueryInput!)` |
178
- | `searchBox` | `SearchBoxInput!` | Destination autocomplete / search form state |
179
- | `searchProperties` | `SearchInput!` | Returns 500 without auth session |
180
- | `propertyDetails` | `PropertyDetailsQueryInput!` | Returns 500 without auth session |
181
- | `popularDestinations` | `PopularDestinationsInput!` | Returns validation error (type mismatch) |
182
-
183
- **Important:** Booking.com GraphQL uses an **operation name whitelist** for
184
- some operations. If you get `GRAPHQL_UNKNOWN_OPERATION_NAME`, try any of the
185
- following confirmed working names: `SearchResultsPage`, `SearchQuery`,
186
- `HotelCardsList`, `SearchResultsList`, `PropertySearch`, `BookingSearch`.
187
-
188
- **Operation names that bypass the whitelist restriction** (all return
189
- `{ data: { __typename: 'Query' } }` with `{ __typename }`):
190
- - `SearchResultsPage` ✓ (confirmed, use this)
191
-
192
- **The search query structure** (known but returns 500 without session):
193
- ```graphql
194
- query SearchResultsPage($input: SearchQueryInput!) {
195
- searchQueries {
196
- search(input: $input) {
197
- __typename # Returns SearchQueryResult type
198
- }
199
- }
200
- }
201
- ```
202
-
203
- With `SearchQueryInput` fields (inferred from URL parameters, confirmed
204
- accepted by validation):
205
- ```json
206
- {
207
- "dest_id": "-1456928",
208
- "dest_type": "CITY",
209
- "checkin": "2026-05-01",
210
- "checkout": "2026-05-03",
211
- "group_adults": "2",
212
- "no_rooms": "1",
213
- "group_children": "0",
214
- "selected_currency": "USD"
215
- }
216
- ```
217
-
218
- ---
219
-
220
- ## URL Parameter Reference
221
-
222
- ### Search Results
223
- `https://www.booking.com/searchresults.html`
224
-
225
- | Parameter | Type | Example | Notes |
226
- |-----------|------|---------|-------|
227
- | `ss` | string | `Paris` | Free-text: city, hotel name, address |
228
- | `dest_id` | string | `-1456928` | Numeric city/region ID (negative = city) |
229
- | `dest_type` | string | `CITY` | `CITY`, `REGION`, `COUNTRY`, `HOTEL`, `AIRPORT`, `DISTRICT`, `LANDMARK` |
230
- | `checkin` | `YYYY-MM-DD` | `2026-05-01` | |
231
- | `checkout` | `YYYY-MM-DD` | `2026-05-03` | |
232
- | `group_adults` | int | `2` | |
233
- | `no_rooms` | int | `1` | |
234
- | `group_children` | int | `0` | |
235
- | `age` | int (repeatable) | `5` | Child age; one per child |
236
- | `selected_currency` | string | `USD` | ISO 4217 currency code |
237
- | `lang` | string | `en-us` | BCP 47 locale |
238
- | `nflt` | string | `ht_id=204;class=4` | Semicolon-separated filters |
239
- | `order` | string | `price` | Sort: `price`, `class`, `review_score`, `distance`, `upsort_bh` |
240
- | `offset` | int | `25` | Pagination (0-based, step 25) |
241
- | `rows` | int | `25` | Results per page (max 25) |
242
- | `map` | `1` | `1` | Map view mode |
243
- | `src` | string | `searchresults` | Source context (cosmetic) |
244
-
245
- **Common `nflt` filter codes:**
246
- - `ht_id=204` — Hotels only
247
- - `class=3;class=4;class=5` — Star rating
248
- - `review_score=90` — Guest rating ≥ 9.0
249
- - `fc=2` — Free cancellation
250
- - `rm_types=…` — Room type
251
- - `pri=1;pri=2` — Price tier (budget / mid / upscale)
252
-
253
- ### Property Pages
254
- `https://www.booking.com/hotel/{country_code}/{hotel_slug}.html`
255
-
256
- Confirmed from sitemap (74 shards, ~3.5M properties):
257
- ```
258
- https://www.booking.com/hotel/{cc}/{slug}.html
259
- https://www.booking.com/hotel/{cc}/{slug}.en-gb.html
260
- https://www.booking.com/hotel/{cc}/{slug}.{lang}.html
261
- ```
262
- - `cc` = 2-letter ISO country code (e.g., `fr`, `us`, `gb`, `de`, `jp`)
263
- - `slug` = hotel name, lowercase, hyphen-separated
264
- - Locale suffix optional; omit for default (English)
265
-
266
- ### City / Region / Country Pages
267
- ```
268
- https://www.booking.com/city/{cc}/{city-slug}.html
269
- https://www.booking.com/region/{cc}/{region-slug}.html
270
- https://www.booking.com/country/{cc}.html
271
- ```
272
-
273
- ---
274
-
275
- ## Browser-Based Extraction (Required for All Data)
276
-
277
- Since `http_get` is blocked, all actual data extraction requires the browser
278
- (`goto` + `js`). The WAF challenge resolves automatically in Chrome.
279
-
280
- ### Initial Navigation
281
-
282
- ```python
283
- # Always use new_tab() for the first Booking.com load in a session
284
- tid = new_tab("https://www.booking.com/searchresults.html?ss=Paris&checkin=2026-05-01&checkout=2026-05-03&group_adults=2&no_rooms=1&selected_currency=USD")
285
- wait_for_load()
286
- wait(3) # React hydration takes ~3s after readyState=complete
287
-
288
- # Check for WAF challenge still running (rare in real Chrome)
289
- url = page_info()["url"]
290
- if "chal_t=" in url:
291
- wait(5) # WAF challenge resolving
292
- wait_for_load()
293
- ```
294
-
295
- ### GDPR / Cookie Consent Banner (EU Visitors)
296
-
297
- Shown to visitors with EU IP addresses or EU `Accept-Language` headers **after**
298
- the WAF challenge resolves. It blocks interaction until dismissed.
299
-
300
- ```python
301
- def dismiss_cookie_banner():
302
- # Booking.com uses data-testid="accept" on the Accept button
303
- accepted = js("""
304
- (function() {
305
- var btn = document.querySelector('[data-testid="accept"]')
306
- || document.querySelector('#onetrust-accept-btn-handler')
307
- || document.querySelector('[aria-label*="Accept"]');
308
- if (btn) { btn.click(); return true; }
309
- return false;
310
- })()
311
- """)
312
- return accepted
313
-
314
- # Call immediately after load if you have an EU IP
315
- if dismiss_cookie_banner():
316
- wait(1)
317
- ```
318
-
319
- The consent banner does **not** appear in the WAF stub — it only renders after
320
- the full React app loads. Non-EU visitors (US IP, `Accept-Language: en-US`)
321
- may not see it at all.
322
-
323
- ### Search Results Page Extraction
324
-
325
- ```python
326
- results = js("""
327
- Array.from(document.querySelectorAll('[data-testid="property-card"]')).map(el => ({
328
- name: el.querySelector('[data-testid="title"]')?.innerText?.trim(),
329
- url: el.querySelector('[data-testid="title-link"]')?.href,
330
- price: el.querySelector('[data-testid="price-and-discounted-price"]')?.innerText?.trim(),
331
- rating: el.querySelector('[data-testid="review-score"]')?.innerText?.trim(),
332
- stars: el.querySelectorAll('[data-testid="rating-stars"] svg').length,
333
- location: el.querySelector('[data-testid="address"]')?.innerText?.trim(),
334
- availability_note: el.querySelector('[data-testid="availability-rate-information"]')?.innerText?.trim(),
335
- is_genius: !!el.querySelector('[data-testid="genius-label"]'),
336
- }))
337
- """)
338
- ```
339
-
340
- **Field notes:**
341
- - `data-testid="property-card"` — confirmed selector for result cards (as of
342
- 2025-2026; Booking migrated from `sr-hotel` class to data-testid attributes).
343
- - `data-testid="price-and-discounted-price"` — contains the nightly rate;
344
- may show original + discounted price together as text.
345
- - `data-testid="review-score"` — contains both the numeric score (e.g.,
346
- `"9.2"`) and the label (e.g., `"Superb"`); use `.split('\n')[0]` for score.
347
- - `data-testid="rating-stars"` — star rating icons; count SVG children for
348
- star count.
349
- - Results are loaded asynchronously; 3s wait after `wait_for_load()` is
350
- required for all cards to render.
351
-
352
- ### Pagination
353
-
354
- ```python
355
- # Method 1: Next page button
356
- next_btn = js("document.querySelector('[data-testid=\"pagination-next\"]')?.href")
357
- if next_btn:
358
- goto_url(next_btn)
359
- wait_for_load()
360
- wait(3)
361
-
362
- # Method 2: Offset parameter (25 results per page)
363
- current_url = page_info()["url"]
364
- offset = 25 # next page
365
- goto_url(current_url + f"&offset={offset}")
366
- wait_for_load()
367
- wait(3)
368
- ```
369
-
370
- ### Property / Hotel Page Extraction
371
-
372
- ```python
373
- detail = js("""
374
- ({
375
- name: document.querySelector('[data-testid="property-name"]')?.innerText?.trim()
376
- || document.querySelector('h2.hp__hotel-name, h1.pp-hotel-name-title')?.innerText?.trim(),
377
- rating: document.querySelector('[data-testid="rating-squares"]')
378
- ? document.querySelectorAll('[data-testid="rating-squares"] svg').length
379
- : null,
380
- score: document.querySelector('[data-testid="review-score-right-component"] .ac4a7896c7')?.innerText
381
- || document.querySelector('[aria-label*="Scored"]')?.getAttribute('aria-label'),
382
- address: document.querySelector('[data-testid="PropertyHeaderAddressDesktop"]')?.innerText?.trim()
383
- || document.querySelector('[id="hotel_address"]')?.innerText?.trim(),
384
- description: document.querySelector('[data-testid="property-description-content"]')?.innerText?.trim()
385
- || document.querySelector('#property_description_content')?.innerText?.trim(),
386
- amenities: Array.from(document.querySelectorAll('[data-testid="facility-list-item"]'))
387
- .map(e => e.innerText?.trim()).filter(Boolean),
388
- room_types: Array.from(document.querySelectorAll('[data-testid="roomstable-accordion"]'))
389
- .map(el => ({
390
- name: el.querySelector('[data-testid="room-type-name"]')?.innerText?.trim(),
391
- price: el.querySelector('[data-testid="price-and-discounted-price"]')?.innerText?.trim(),
392
- })),
393
- lat: document.querySelector('a[href*="maps.google"]')
394
- ?.href?.match(/[?&]q=([^&]+)/)?.[1]?.split(',')[0],
395
- lon: document.querySelector('a[href*="maps.google"]')
396
- ?.href?.match(/[?&]q=([^&]+)/)?.[1]?.split(',')[1],
397
- })
398
- """)
399
- ```
400
-
401
- ### JSON-LD Schema (Property Pages)
402
-
403
- Property pages embed JSON-LD when fully rendered in browser. The schema type
404
- is `Hotel`:
405
-
406
- ```python
407
- ld_json = js("""
408
- (function() {
409
- for (var s of document.querySelectorAll('script[type="application/ld+json"]')) {
410
- try {
411
- var d = JSON.parse(s.textContent);
412
- if (d['@type'] === 'Hotel' || d['@type'] === 'LodgingBusiness') return d;
413
- } catch(e) {}
414
- }
415
- return null;
416
- })()
417
- """)
418
- # Returns:
419
- # {
420
- # "@type": "Hotel",
421
- # "name": "Hotel de Crillon",
422
- # "aggregateRating": {"ratingValue": "9.2", "reviewCount": "1423"},
423
- # "address": {"streetAddress": "10 Place de la Concorde", "addressLocality": "Paris", ...},
424
- # "geo": {"latitude": 48.865, "longitude": 2.321},
425
- # "starRating": {"ratingValue": 5}
426
- # }
427
- ```
428
-
429
- JSON-LD is **not present in the WAF stub** — it only exists in the fully
430
- rendered page. `http_get` will never see it.
431
-
432
- ### Embedded JavaScript Data (`__NEXT_DATA__` / `b_hotel_data`)
433
-
434
- Booking.com's React app may embed search state in `window.__NEXT_DATA__` or
435
- legacy `b_hotel_data` globals. Access via:
436
-
437
- ```python
438
- next_data = js("window.__NEXT_DATA__") # dict or None
439
- b_hotel = js("window.b_hotel_data") # dict or None — legacy pages
440
- ```
441
-
442
- These globals are not present in the WAF stub and their availability depends
443
- on page version. Prefer data-testid selectors which are more stable.
444
-
445
- ---
446
-
447
- ## Pricing Extraction Patterns
448
-
449
- Booking.com shows prices per night with multiple formatting variants:
450
-
451
- ```python
452
- price_patterns = js("""
453
- ({
454
- // Search results card price
455
- search_price: document.querySelector('[data-testid="price-and-discounted-price"]')?.innerText,
456
- // Property page room price
457
- room_price: document.querySelector('[data-testid="price-and-discounted-price"]')?.innerText,
458
- // Original (crossed-out) price before discount
459
- original_price: document.querySelector('[data-testid="recommended-units-price"] s')?.innerText
460
- || document.querySelector('.prco-valign-middle-helper del')?.innerText,
461
- // "Price for X nights" summary
462
- total_price: document.querySelector('[data-testid="checkout-price-summary"]')?.innerText,
463
- // Genius discount tag
464
- genius_discount: document.querySelector('[data-testid="genius-rate-badge"]')?.innerText,
465
- })
466
- """)
467
- ```
468
-
469
- **Price display nuances:**
470
- - Prices shown are **per night** by default; multiply by nights for total.
471
- - Currency is controlled by `selected_currency` URL param or user account
472
- setting.
473
- - Taxes/fees may or may not be included; look for `"Includes taxes and fees"`
474
- or `"+ taxes & fees"` text adjacent to the price element.
475
- - The `data-testid="price-and-discounted-price"` element returns a single
476
- string that may contain both original and discounted price
477
- (e.g., `"US$400\nUS$320"`).
478
-
479
- ---
480
-
481
- ## WAF Detection & Handling in Browser
482
-
483
- The WAF resolves automatically in a real Chrome session. To detect if
484
- something went wrong:
485
-
486
- ```python
487
- def check_booking_waf():
488
- url = page_info()["url"]
489
- html_snippet = js("document.body?.innerHTML?.slice(0, 500)") or ""
490
- return (
491
- "chal_t=" in url
492
- or "AwsWafIntegration" in html_snippet
493
- or "challenge-container" in html_snippet
494
- )
495
-
496
- def wait_past_waf(timeout=15):
497
- import time
498
- deadline = time.time() + timeout
499
- while time.time() < deadline:
500
- if not check_booking_waf():
501
- return True
502
- wait(1)
503
- return False # timed out — WAF didn't resolve
504
-
505
- # Use after goto_url():
506
- goto_url("https://www.booking.com/searchresults.html?ss=London&checkin=2026-06-01&checkout=2026-06-03&group_adults=2&no_rooms=1")
507
- wait_for_load()
508
- wait_past_waf()
509
- wait(2) # hydration
510
- ```
511
-
512
- ---
513
-
514
- ## Sitemap-Based URL Discovery Workflow
515
-
516
- Use this when you need a list of property URLs for a given country or city,
517
- without needing to scrape search results pages in the browser:
518
-
519
- ```python
520
- import gzip, re, urllib.request
521
-
522
- GOOGLEBOT = {"User-Agent": "Googlebot/2.1 (+http://www.google.com/bot.html)"}
523
-
524
- def get_hotel_urls_for_country(cc: str, lang: str = "en-gb", max_shards: int = 2) -> list[str]:
525
- """Returns property page URLs for a country from sitemaps. No browser needed."""
526
- idx_url = f"https://www.booking.com/sitembk-hotel-index.xml"
527
- idx = http_get(idx_url, headers=GOOGLEBOT)
528
- pattern = rf'<loc>(https://www\.booking\.com/sitembk-hotel-{lang}\.\d+\.xml\.gz)</loc>'
529
- shards = re.findall(pattern, idx)[:max_shards]
530
-
531
- urls = []
532
- for shard_url in shards:
533
- req = urllib.request.Request(shard_url, headers=GOOGLEBOT)
534
- with urllib.request.urlopen(req, timeout=60) as r:
535
- xml = gzip.decompress(r.read()).decode()
536
- all_urls = re.findall(r'<loc>(https://[^<]+)</loc>', xml)
537
- # Filter by country code
538
- country_urls = [u for u in all_urls if f"/hotel/{cc}/" in u]
539
- urls.extend(country_urls)
540
- return urls
541
-
542
- # Example: get French hotel URLs (no browser needed, instant)
543
- # french_hotels = get_hotel_urls_for_country("fr", max_shards=1)
544
- # len(french_hotels) -> ~8,000+ URLs from one shard
545
- ```
546
-
547
- ---
548
-
549
- ## Gotchas
550
-
551
- - **WAF blocks everything via `http_get`** — there is no User-Agent or header
552
- combination that bypasses it. The challenge is cryptographic, not heuristic.
553
- - **WAF has two page sizes** — ~3,962 bytes (newer SDK, no AJAX reporter) and
554
- ~8,410 bytes (older with error reporting). Both are equally blocked.
555
- - **Sitemaps whitelist Googlebot UA** — `Googlebot/2.1` UA works for sitemap
556
- XML/GZ files but NOT for hotel/city/search HTML pages.
557
- - **GraphQL endpoint is unprotected** but useless without a valid Booking.com
558
- session (irene service requires authentication for all substantive queries).
559
- - **GraphQL op-name whitelist**: introspection (`__schema`) is blocked by
560
- operation name restriction. Use field validation errors to probe the schema.
561
- - **GDPR consent banner**: shown after WAF resolves, before React renders
562
- search results. Must be dismissed (click `[data-testid="accept"]`) before
563
- interacting with EU sessions. Non-EU IPs may not see it.
564
- - **React hydration delay**: `wait_for_load()` fires before card data renders.
565
- Always add 2-3s of `wait()` after `wait_for_load()`.
566
- - **`sr-hotel` class is legacy** — Booking.com migrated to data-testid
567
- attributes. Use `[data-testid="property-card"]`, not `.sr-hotel`.
568
- - **Price parsing**: the price element often contains the full string
569
- `"US$400\nUS$320"` when a discount applies. Split on `\n` and take the last
570
- item for current price.
571
- - **Offset pagination cap**: Booking caps results at 1,000 properties per
572
- search (offset 0–975, rows=25). For cities with >1,000 properties, use
573
- filters (`nflt`) to segment results.
574
- - **Currency must be set via URL param**: `selected_currency=USD` in the search
575
- URL; the cookie-based currency selection may not persist across navigation.
576
- - **`dest_id` for cities**: Paris = `-1456928`, Amsterdam = `-2140479`,
577
- London = `-2601889`. Negative integers indicate city-level destinations.
578
- Get the ID by reading it from the URL after using `ss=` search.
1
+ # Booking.com — Scraping & Data Extraction
2
+
3
+ Field-tested against booking.com on 2026-04-18 using `http_get` and the
4
+ `dml/graphql` JSON API. All tests run without a browser session.
5
+
6
+ ---
7
+
8
+ ## TL;DR
9
+
10
+ **`http_get` returns nothing useful from booking.com.** Every HTML page —
11
+ search results, hotel pages, city pages, the homepage — is intercepted by an
12
+ AWS WAF JS challenge before any content is served. The challenge requires
13
+ JavaScript execution to complete a cryptographic puzzle and set an
14
+ `aws-waf-token` cookie. Without a real browser, you get a ~4-8 KB stub page.
15
+
16
+ **What you can do without a browser:**
17
+ - Enumerate hotel/city/region URLs from XML sitemaps (Googlebot UA required).
18
+ - Read `robots.txt` for URL pattern documentation.
19
+ - Query the GraphQL endpoint `https://www.booking.com/dml/graphql` for schema
20
+ exploration (no auth = internal errors, but validation errors reveal the
21
+ schema).
22
+
23
+ **For all actual data extraction, use the browser (`goto` + `js`).**
24
+
25
+ ---
26
+
27
+ ## AWS WAF JS Challenge — What It Is
28
+
29
+ Every `http_get` request to `www.booking.com` receives one of two variants of
30
+ a WAF stub:
31
+
32
+ **Variant A (~3,962 bytes) — modern SDK:**
33
+ ```html
34
+ <script src="https://www.booking.com/__challenge_{KEY}/{HASH}/challenge.js"></script>
35
+ <script>
36
+ AwsWafIntegration.getToken().then(() => { window.location.href = newHref; });
37
+ </script>
38
+ ```
39
+
40
+ **Variant B (~8,410 bytes) — with AJAX error reporting:**
41
+ Same AWS WAF SDK, plus an `XMLHttpRequest`-based error reporter that POSTs to
42
+ `https://reports.booking.com/chal_report`. This variant is more common on
43
+ non-browser UA strings.
44
+
45
+ **Detection in your code:**
46
+ ```python
47
+ def is_waf_blocked(html: str) -> bool:
48
+ return (
49
+ 'AwsWafIntegration' in html
50
+ or 'awsWafCookieDomainList' in html
51
+ or 'challenge.js' in html
52
+ or len(html) < 10_000 and '<title></title>' in html
53
+ )
54
+ ```
55
+
56
+ **What the challenge does:**
57
+ 1. Loads a 1.3 MB obfuscated JS file (`challenge.js`) from a path-keyed URL.
58
+ 2. Executes a cryptographic proof-of-work puzzle client-side.
59
+ 3. Sets an `aws-waf-token` cookie on the `booking.com` domain.
60
+ 4. Redirects to the original URL with `?chal_t={timestamp}&force_referer=`
61
+ appended.
62
+
63
+ This challenge **cannot be solved by `http_get`**. It requires a real JS
64
+ engine. A `bkng` session cookie is set on the first blocked response, but it
65
+ has no value without the WAF token.
66
+
67
+ **User agents tested — all blocked:**
68
+ - Chrome desktop (`Mozilla/5.0 ... Chrome/120`)
69
+ - iPhone/Safari mobile
70
+ - `Googlebot/2.1` (HTML pages only; sitemaps are whitelisted)
71
+ - Default `urllib` UA
72
+
73
+ ---
74
+
75
+ ## What `http_get` CAN Access
76
+
77
+ ### 1. XML Sitemaps (URL discovery)
78
+
79
+ Booking.com whitelists sitemap paths for Googlebot. This lets you enumerate
80
+ millions of property, city, region, and attraction URLs without a browser.
81
+
82
+ ```python
83
+ import gzip, re, urllib.request
84
+
85
+ GOOGLEBOT = {"User-Agent": "Googlebot/2.1 (+http://www.google.com/bot.html)"}
86
+
87
+ def fetch_sitemap_index(url: str) -> list[str]:
88
+ """Returns list of child sitemap URLs from an index sitemap."""
89
+ xml = http_get(url, headers=GOOGLEBOT)
90
+ return re.findall(r'<loc>(https://[^<]+)</loc>', xml)
91
+
92
+ def fetch_sitemap_gz(gz_url: str) -> list[str]:
93
+ """Decompresses a gzipped sitemap and returns all <loc> URLs."""
94
+ req = urllib.request.Request(gz_url, headers=GOOGLEBOT)
95
+ with urllib.request.urlopen(req, timeout=30) as r:
96
+ data = gzip.decompress(r.read())
97
+ return re.findall(r'<loc>(https://[^<]+)</loc>', data.decode())
98
+
99
+ # Example: get all en-gb hotel URLs
100
+ hotel_idx = http_get(
101
+ "https://www.booking.com/sitembk-hotel-index.xml",
102
+ headers=GOOGLEBOT
103
+ )
104
+ # 74 shards for en-gb; each shard has ~45,000-50,000 property URLs
105
+ en_gb_shards = re.findall(
106
+ r'<loc>(https://www\.booking\.com/sitembk-hotel-en-gb\.\d+\.xml\.gz)</loc>',
107
+ hotel_idx
108
+ )
109
+ # hotel_urls = fetch_sitemap_gz(en_gb_shards[0]) # ~50K URLs per shard
110
+ ```
111
+
112
+ **Available sitemap categories (confirmed, 275 total):**
113
+
114
+ | Index URL | Content |
115
+ |-----------|---------|
116
+ | `sitembk-hotel-index.xml` | All properties (~74 en-gb shards, ~3.5M URLs) |
117
+ | `sitembk-city-index.xml` | City landing pages (~6 en-gb shards, ~44K cities) |
118
+ | `sitembk-region-index.xml` | Region landing pages |
119
+ | `sitembk-country-index.xml` | Country landing pages |
120
+ | `sitembk-attractions-index.xml` | Attractions |
121
+ | `sitembk-hotel-review-index.xml` | Review pages |
122
+ | `sitembk-themed-city-{type}-index.xml` | Category-specific city pages (70+ types: hostels, luxury, spa, ski, etc.) |
123
+
124
+ ### 2. `robots.txt`
125
+
126
+ ```python
127
+ robots = http_get("https://www.booking.com/robots.txt", headers={"User-Agent": "Mozilla/5.0"})
128
+ ```
129
+
130
+ - Returns immediately, no WAF.
131
+ - 136 Disallow entries, 275 Sitemap declarations.
132
+ - Documents all URL structures (search results, hotel pages, booking flow, etc.).
133
+
134
+ ### 3. GraphQL Schema Exploration (no auth)
135
+
136
+ The endpoint `https://www.booking.com/dml/graphql` is **not WAF-protected**.
137
+ It accepts POST requests and returns JSON. Without a session, most queries
138
+ return `Internal Server Error` from the backend (`irene` service), but
139
+ **GraphQL validation errors fire before the backend** and reveal the schema.
140
+
141
+ ```python
142
+ import json, urllib.request, gzip
143
+
144
+ GQL_URL = "https://www.booking.com/dml/graphql?lang=en-gb"
145
+ GQL_HEADERS = {
146
+ "User-Agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36",
147
+ "Accept": "application/json",
148
+ "Content-Type": "application/json",
149
+ "Origin": "https://www.booking.com",
150
+ "Referer": "https://www.booking.com/searchresults.html",
151
+ "x-booking-context-action-name": "searchresults",
152
+ "x-booking-context-aid": "376510",
153
+ "x-booking-site-type-id": "1",
154
+ }
155
+
156
+ def gql(operation_name: str, query: str, variables: dict = None) -> dict:
157
+ payload = {"operationName": operation_name, "query": query}
158
+ if variables:
159
+ payload["variables"] = variables
160
+ req = urllib.request.Request(
161
+ GQL_URL,
162
+ data=json.dumps(payload).encode(),
163
+ headers=GQL_HEADERS,
164
+ method="POST"
165
+ )
166
+ with urllib.request.urlopen(req, timeout=20) as r:
167
+ data = r.read()
168
+ if r.headers.get("Content-Encoding") == "gzip":
169
+ data = gzip.decompress(data)
170
+ return json.loads(data.decode())
171
+ ```
172
+
173
+ **Confirmed Query type fields (schema, field-tested 2026-04-18):**
174
+
175
+ | Field | Input type | Notes |
176
+ |-------|-----------|-------|
177
+ | `searchQueries` | none | Root for hotel search; nested `.search(SearchQueryInput!)` |
178
+ | `searchBox` | `SearchBoxInput!` | Destination autocomplete / search form state |
179
+ | `searchProperties` | `SearchInput!` | Returns 500 without auth session |
180
+ | `propertyDetails` | `PropertyDetailsQueryInput!` | Returns 500 without auth session |
181
+ | `popularDestinations` | `PopularDestinationsInput!` | Returns validation error (type mismatch) |
182
+
183
+ **Important:** Booking.com GraphQL uses an **operation name whitelist** for
184
+ some operations. If you get `GRAPHQL_UNKNOWN_OPERATION_NAME`, try any of the
185
+ following confirmed working names: `SearchResultsPage`, `SearchQuery`,
186
+ `HotelCardsList`, `SearchResultsList`, `PropertySearch`, `BookingSearch`.
187
+
188
+ **Operation names that bypass the whitelist restriction** (all return
189
+ `{ data: { __typename: 'Query' } }` with `{ __typename }`):
190
+ - `SearchResultsPage` ✓ (confirmed, use this)
191
+
192
+ **The search query structure** (known but returns 500 without session):
193
+ ```graphql
194
+ query SearchResultsPage($input: SearchQueryInput!) {
195
+ searchQueries {
196
+ search(input: $input) {
197
+ __typename # Returns SearchQueryResult type
198
+ }
199
+ }
200
+ }
201
+ ```
202
+
203
+ With `SearchQueryInput` fields (inferred from URL parameters, confirmed
204
+ accepted by validation):
205
+ ```json
206
+ {
207
+ "dest_id": "-1456928",
208
+ "dest_type": "CITY",
209
+ "checkin": "2026-05-01",
210
+ "checkout": "2026-05-03",
211
+ "group_adults": "2",
212
+ "no_rooms": "1",
213
+ "group_children": "0",
214
+ "selected_currency": "USD"
215
+ }
216
+ ```
217
+
218
+ ---
219
+
220
+ ## URL Parameter Reference
221
+
222
+ ### Search Results
223
+ `https://www.booking.com/searchresults.html`
224
+
225
+ | Parameter | Type | Example | Notes |
226
+ |-----------|------|---------|-------|
227
+ | `ss` | string | `Paris` | Free-text: city, hotel name, address |
228
+ | `dest_id` | string | `-1456928` | Numeric city/region ID (negative = city) |
229
+ | `dest_type` | string | `CITY` | `CITY`, `REGION`, `COUNTRY`, `HOTEL`, `AIRPORT`, `DISTRICT`, `LANDMARK` |
230
+ | `checkin` | `YYYY-MM-DD` | `2026-05-01` | |
231
+ | `checkout` | `YYYY-MM-DD` | `2026-05-03` | |
232
+ | `group_adults` | int | `2` | |
233
+ | `no_rooms` | int | `1` | |
234
+ | `group_children` | int | `0` | |
235
+ | `age` | int (repeatable) | `5` | Child age; one per child |
236
+ | `selected_currency` | string | `USD` | ISO 4217 currency code |
237
+ | `lang` | string | `en-us` | BCP 47 locale |
238
+ | `nflt` | string | `ht_id=204;class=4` | Semicolon-separated filters |
239
+ | `order` | string | `price` | Sort: `price`, `class`, `review_score`, `distance`, `upsort_bh` |
240
+ | `offset` | int | `25` | Pagination (0-based, step 25) |
241
+ | `rows` | int | `25` | Results per page (max 25) |
242
+ | `map` | `1` | `1` | Map view mode |
243
+ | `src` | string | `searchresults` | Source context (cosmetic) |
244
+
245
+ **Common `nflt` filter codes:**
246
+ - `ht_id=204` — Hotels only
247
+ - `class=3;class=4;class=5` — Star rating
248
+ - `review_score=90` — Guest rating ≥ 9.0
249
+ - `fc=2` — Free cancellation
250
+ - `rm_types=…` — Room type
251
+ - `pri=1;pri=2` — Price tier (budget / mid / upscale)
252
+
253
+ ### Property Pages
254
+ `https://www.booking.com/hotel/{country_code}/{hotel_slug}.html`
255
+
256
+ Confirmed from sitemap (74 shards, ~3.5M properties):
257
+ ```
258
+ https://www.booking.com/hotel/{cc}/{slug}.html
259
+ https://www.booking.com/hotel/{cc}/{slug}.en-gb.html
260
+ https://www.booking.com/hotel/{cc}/{slug}.{lang}.html
261
+ ```
262
+ - `cc` = 2-letter ISO country code (e.g., `fr`, `us`, `gb`, `de`, `jp`)
263
+ - `slug` = hotel name, lowercase, hyphen-separated
264
+ - Locale suffix optional; omit for default (English)
265
+
266
+ ### City / Region / Country Pages
267
+ ```
268
+ https://www.booking.com/city/{cc}/{city-slug}.html
269
+ https://www.booking.com/region/{cc}/{region-slug}.html
270
+ https://www.booking.com/country/{cc}.html
271
+ ```
272
+
273
+ ---
274
+
275
+ ## Browser-Based Extraction (Required for All Data)
276
+
277
+ Since `http_get` is blocked, all actual data extraction requires the browser
278
+ (`goto` + `js`). The WAF challenge resolves automatically in Chrome.
279
+
280
+ ### Initial Navigation
281
+
282
+ ```python
283
+ # Always use new_tab() for the first Booking.com load in a session
284
+ tid = new_tab("https://www.booking.com/searchresults.html?ss=Paris&checkin=2026-05-01&checkout=2026-05-03&group_adults=2&no_rooms=1&selected_currency=USD")
285
+ wait_for_load()
286
+ wait(3) # React hydration takes ~3s after readyState=complete
287
+
288
+ # Check for WAF challenge still running (rare in real Chrome)
289
+ url = page_info()["url"]
290
+ if "chal_t=" in url:
291
+ wait(5) # WAF challenge resolving
292
+ wait_for_load()
293
+ ```
294
+
295
+ ### GDPR / Cookie Consent Banner (EU Visitors)
296
+
297
+ Shown to visitors with EU IP addresses or EU `Accept-Language` headers **after**
298
+ the WAF challenge resolves. It blocks interaction until dismissed.
299
+
300
+ ```python
301
+ def dismiss_cookie_banner():
302
+ # Booking.com uses data-testid="accept" on the Accept button
303
+ accepted = js("""
304
+ (function() {
305
+ var btn = document.querySelector('[data-testid="accept"]')
306
+ || document.querySelector('#onetrust-accept-btn-handler')
307
+ || document.querySelector('[aria-label*="Accept"]');
308
+ if (btn) { btn.click(); return true; }
309
+ return false;
310
+ })()
311
+ """)
312
+ return accepted
313
+
314
+ # Call immediately after load if you have an EU IP
315
+ if dismiss_cookie_banner():
316
+ wait(1)
317
+ ```
318
+
319
+ The consent banner does **not** appear in the WAF stub — it only renders after
320
+ the full React app loads. Non-EU visitors (US IP, `Accept-Language: en-US`)
321
+ may not see it at all.
322
+
323
+ ### Search Results Page Extraction
324
+
325
+ ```python
326
+ results = js("""
327
+ Array.from(document.querySelectorAll('[data-testid="property-card"]')).map(el => ({
328
+ name: el.querySelector('[data-testid="title"]')?.innerText?.trim(),
329
+ url: el.querySelector('[data-testid="title-link"]')?.href,
330
+ price: el.querySelector('[data-testid="price-and-discounted-price"]')?.innerText?.trim(),
331
+ rating: el.querySelector('[data-testid="review-score"]')?.innerText?.trim(),
332
+ stars: el.querySelectorAll('[data-testid="rating-stars"] svg').length,
333
+ location: el.querySelector('[data-testid="address"]')?.innerText?.trim(),
334
+ availability_note: el.querySelector('[data-testid="availability-rate-information"]')?.innerText?.trim(),
335
+ is_genius: !!el.querySelector('[data-testid="genius-label"]'),
336
+ }))
337
+ """)
338
+ ```
339
+
340
+ **Field notes:**
341
+ - `data-testid="property-card"` — confirmed selector for result cards (as of
342
+ 2025-2026; Booking migrated from `sr-hotel` class to data-testid attributes).
343
+ - `data-testid="price-and-discounted-price"` — contains the nightly rate;
344
+ may show original + discounted price together as text.
345
+ - `data-testid="review-score"` — contains both the numeric score (e.g.,
346
+ `"9.2"`) and the label (e.g., `"Superb"`); use `.split('\n')[0]` for score.
347
+ - `data-testid="rating-stars"` — star rating icons; count SVG children for
348
+ star count.
349
+ - Results are loaded asynchronously; 3s wait after `wait_for_load()` is
350
+ required for all cards to render.
351
+
352
+ ### Pagination
353
+
354
+ ```python
355
+ # Method 1: Next page button
356
+ next_btn = js("document.querySelector('[data-testid=\"pagination-next\"]')?.href")
357
+ if next_btn:
358
+ goto_url(next_btn)
359
+ wait_for_load()
360
+ wait(3)
361
+
362
+ # Method 2: Offset parameter (25 results per page)
363
+ current_url = page_info()["url"]
364
+ offset = 25 # next page
365
+ goto_url(current_url + f"&offset={offset}")
366
+ wait_for_load()
367
+ wait(3)
368
+ ```
369
+
370
+ ### Property / Hotel Page Extraction
371
+
372
+ ```python
373
+ detail = js("""
374
+ ({
375
+ name: document.querySelector('[data-testid="property-name"]')?.innerText?.trim()
376
+ || document.querySelector('h2.hp__hotel-name, h1.pp-hotel-name-title')?.innerText?.trim(),
377
+ rating: document.querySelector('[data-testid="rating-squares"]')
378
+ ? document.querySelectorAll('[data-testid="rating-squares"] svg').length
379
+ : null,
380
+ score: document.querySelector('[data-testid="review-score-right-component"] .ac4a7896c7')?.innerText
381
+ || document.querySelector('[aria-label*="Scored"]')?.getAttribute('aria-label'),
382
+ address: document.querySelector('[data-testid="PropertyHeaderAddressDesktop"]')?.innerText?.trim()
383
+ || document.querySelector('[id="hotel_address"]')?.innerText?.trim(),
384
+ description: document.querySelector('[data-testid="property-description-content"]')?.innerText?.trim()
385
+ || document.querySelector('#property_description_content')?.innerText?.trim(),
386
+ amenities: Array.from(document.querySelectorAll('[data-testid="facility-list-item"]'))
387
+ .map(e => e.innerText?.trim()).filter(Boolean),
388
+ room_types: Array.from(document.querySelectorAll('[data-testid="roomstable-accordion"]'))
389
+ .map(el => ({
390
+ name: el.querySelector('[data-testid="room-type-name"]')?.innerText?.trim(),
391
+ price: el.querySelector('[data-testid="price-and-discounted-price"]')?.innerText?.trim(),
392
+ })),
393
+ lat: document.querySelector('a[href*="maps.google"]')
394
+ ?.href?.match(/[?&]q=([^&]+)/)?.[1]?.split(',')[0],
395
+ lon: document.querySelector('a[href*="maps.google"]')
396
+ ?.href?.match(/[?&]q=([^&]+)/)?.[1]?.split(',')[1],
397
+ })
398
+ """)
399
+ ```
400
+
401
+ ### JSON-LD Schema (Property Pages)
402
+
403
+ Property pages embed JSON-LD when fully rendered in browser. The schema type
404
+ is `Hotel`:
405
+
406
+ ```python
407
+ ld_json = js("""
408
+ (function() {
409
+ for (var s of document.querySelectorAll('script[type="application/ld+json"]')) {
410
+ try {
411
+ var d = JSON.parse(s.textContent);
412
+ if (d['@type'] === 'Hotel' || d['@type'] === 'LodgingBusiness') return d;
413
+ } catch(e) {}
414
+ }
415
+ return null;
416
+ })()
417
+ """)
418
+ # Returns:
419
+ # {
420
+ # "@type": "Hotel",
421
+ # "name": "Hotel de Crillon",
422
+ # "aggregateRating": {"ratingValue": "9.2", "reviewCount": "1423"},
423
+ # "address": {"streetAddress": "10 Place de la Concorde", "addressLocality": "Paris", ...},
424
+ # "geo": {"latitude": 48.865, "longitude": 2.321},
425
+ # "starRating": {"ratingValue": 5}
426
+ # }
427
+ ```
428
+
429
+ JSON-LD is **not present in the WAF stub** — it only exists in the fully
430
+ rendered page. `http_get` will never see it.
431
+
432
+ ### Embedded JavaScript Data (`__NEXT_DATA__` / `b_hotel_data`)
433
+
434
+ Booking.com's React app may embed search state in `window.__NEXT_DATA__` or
435
+ legacy `b_hotel_data` globals. Access via:
436
+
437
+ ```python
438
+ next_data = js("window.__NEXT_DATA__") # dict or None
439
+ b_hotel = js("window.b_hotel_data") # dict or None — legacy pages
440
+ ```
441
+
442
+ These globals are not present in the WAF stub and their availability depends
443
+ on page version. Prefer data-testid selectors which are more stable.
444
+
445
+ ---
446
+
447
+ ## Pricing Extraction Patterns
448
+
449
+ Booking.com shows prices per night with multiple formatting variants:
450
+
451
+ ```python
452
+ price_patterns = js("""
453
+ ({
454
+ // Search results card price
455
+ search_price: document.querySelector('[data-testid="price-and-discounted-price"]')?.innerText,
456
+ // Property page room price
457
+ room_price: document.querySelector('[data-testid="price-and-discounted-price"]')?.innerText,
458
+ // Original (crossed-out) price before discount
459
+ original_price: document.querySelector('[data-testid="recommended-units-price"] s')?.innerText
460
+ || document.querySelector('.prco-valign-middle-helper del')?.innerText,
461
+ // "Price for X nights" summary
462
+ total_price: document.querySelector('[data-testid="checkout-price-summary"]')?.innerText,
463
+ // Genius discount tag
464
+ genius_discount: document.querySelector('[data-testid="genius-rate-badge"]')?.innerText,
465
+ })
466
+ """)
467
+ ```
468
+
469
+ **Price display nuances:**
470
+ - Prices shown are **per night** by default; multiply by nights for total.
471
+ - Currency is controlled by `selected_currency` URL param or user account
472
+ setting.
473
+ - Taxes/fees may or may not be included; look for `"Includes taxes and fees"`
474
+ or `"+ taxes & fees"` text adjacent to the price element.
475
+ - The `data-testid="price-and-discounted-price"` element returns a single
476
+ string that may contain both original and discounted price
477
+ (e.g., `"US$400\nUS$320"`).
478
+
479
+ ---
480
+
481
+ ## WAF Detection & Handling in Browser
482
+
483
+ The WAF resolves automatically in a real Chrome session. To detect if
484
+ something went wrong:
485
+
486
+ ```python
487
+ def check_booking_waf():
488
+ url = page_info()["url"]
489
+ html_snippet = js("document.body?.innerHTML?.slice(0, 500)") or ""
490
+ return (
491
+ "chal_t=" in url
492
+ or "AwsWafIntegration" in html_snippet
493
+ or "challenge-container" in html_snippet
494
+ )
495
+
496
+ def wait_past_waf(timeout=15):
497
+ import time
498
+ deadline = time.time() + timeout
499
+ while time.time() < deadline:
500
+ if not check_booking_waf():
501
+ return True
502
+ wait(1)
503
+ return False # timed out — WAF didn't resolve
504
+
505
+ # Use after goto_url():
506
+ goto_url("https://www.booking.com/searchresults.html?ss=London&checkin=2026-06-01&checkout=2026-06-03&group_adults=2&no_rooms=1")
507
+ wait_for_load()
508
+ wait_past_waf()
509
+ wait(2) # hydration
510
+ ```
511
+
512
+ ---
513
+
514
+ ## Sitemap-Based URL Discovery Workflow
515
+
516
+ Use this when you need a list of property URLs for a given country or city,
517
+ without needing to scrape search results pages in the browser:
518
+
519
+ ```python
520
+ import gzip, re, urllib.request
521
+
522
+ GOOGLEBOT = {"User-Agent": "Googlebot/2.1 (+http://www.google.com/bot.html)"}
523
+
524
+ def get_hotel_urls_for_country(cc: str, lang: str = "en-gb", max_shards: int = 2) -> list[str]:
525
+ """Returns property page URLs for a country from sitemaps. No browser needed."""
526
+ idx_url = f"https://www.booking.com/sitembk-hotel-index.xml"
527
+ idx = http_get(idx_url, headers=GOOGLEBOT)
528
+ pattern = rf'<loc>(https://www\.booking\.com/sitembk-hotel-{lang}\.\d+\.xml\.gz)</loc>'
529
+ shards = re.findall(pattern, idx)[:max_shards]
530
+
531
+ urls = []
532
+ for shard_url in shards:
533
+ req = urllib.request.Request(shard_url, headers=GOOGLEBOT)
534
+ with urllib.request.urlopen(req, timeout=60) as r:
535
+ xml = gzip.decompress(r.read()).decode()
536
+ all_urls = re.findall(r'<loc>(https://[^<]+)</loc>', xml)
537
+ # Filter by country code
538
+ country_urls = [u for u in all_urls if f"/hotel/{cc}/" in u]
539
+ urls.extend(country_urls)
540
+ return urls
541
+
542
+ # Example: get French hotel URLs (no browser needed, instant)
543
+ # french_hotels = get_hotel_urls_for_country("fr", max_shards=1)
544
+ # len(french_hotels) -> ~8,000+ URLs from one shard
545
+ ```
546
+
547
+ ---
548
+
549
+ ## Gotchas
550
+
551
+ - **WAF blocks everything via `http_get`** — there is no User-Agent or header
552
+ combination that bypasses it. The challenge is cryptographic, not heuristic.
553
+ - **WAF has two page sizes** — ~3,962 bytes (newer SDK, no AJAX reporter) and
554
+ ~8,410 bytes (older with error reporting). Both are equally blocked.
555
+ - **Sitemaps whitelist Googlebot UA** — `Googlebot/2.1` UA works for sitemap
556
+ XML/GZ files but NOT for hotel/city/search HTML pages.
557
+ - **GraphQL endpoint is unprotected** but useless without a valid Booking.com
558
+ session (irene service requires authentication for all substantive queries).
559
+ - **GraphQL op-name whitelist**: introspection (`__schema`) is blocked by
560
+ operation name restriction. Use field validation errors to probe the schema.
561
+ - **GDPR consent banner**: shown after WAF resolves, before React renders
562
+ search results. Must be dismissed (click `[data-testid="accept"]`) before
563
+ interacting with EU sessions. Non-EU IPs may not see it.
564
+ - **React hydration delay**: `wait_for_load()` fires before card data renders.
565
+ Always add 2-3s of `wait()` after `wait_for_load()`.
566
+ - **`sr-hotel` class is legacy** — Booking.com migrated to data-testid
567
+ attributes. Use `[data-testid="property-card"]`, not `.sr-hotel`.
568
+ - **Price parsing**: the price element often contains the full string
569
+ `"US$400\nUS$320"` when a discount applies. Split on `\n` and take the last
570
+ item for current price.
571
+ - **Offset pagination cap**: Booking caps results at 1,000 properties per
572
+ search (offset 0–975, rows=25). For cities with >1,000 properties, use
573
+ filters (`nflt`) to segment results.
574
+ - **Currency must be set via URL param**: `selected_currency=USD` in the search
575
+ URL; the cookie-based currency selection may not persist across navigation.
576
+ - **`dest_id` for cities**: Paris = `-1456928`, Amsterdam = `-2140479`,
577
+ London = `-2601889`. Negative integers indicate city-level destinations.
578
+ Get the ID by reading it from the URL after using `ss=` search.