@pencil-agent/nano-pencil 2.0.0-beta.8 → 2.0.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
Files changed (241) hide show
  1. package/README.md +267 -267
  2. package/dist/build-meta.json +3 -3
  3. package/dist/core/export-html/AGENT.md +11 -11
  4. package/dist/core/export-html/template.css +971 -971
  5. package/dist/core/export-html/template.html +54 -54
  6. package/dist/core/extensions-host/index.d.ts +1 -1
  7. package/dist/core/extensions-host/loader.js +1 -1
  8. package/dist/core/extensions-host/runner.d.ts +1 -0
  9. package/dist/core/extensions-host/runner.js +2 -2
  10. package/dist/core/extensions-host/types.d.ts +17 -22
  11. package/dist/core/lib/ai/src/types.d.ts +12 -2
  12. package/dist/core/persona/persona-manager.js +5 -2
  13. package/dist/core/runtime/agent-session.js +3 -3
  14. package/dist/core/runtime/extension-core-bindings.d.ts +1 -0
  15. package/dist/core/runtime/extension-core-bindings.js +2 -2
  16. package/dist/extensions/builtin/AGENT.md +115 -115
  17. package/dist/extensions/builtin/browser/AGENT.md +17 -17
  18. package/dist/extensions/builtin/browser/agent-workspace/agent_helpers.py +12 -12
  19. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/amazon/product-search.md +198 -198
  20. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/archive-org/scraping.md +341 -341
  21. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/arxiv/scraping.md +311 -311
  22. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/arxiv-bulk/scraping.md +333 -333
  23. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/atlas/overview.md +70 -70
  24. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/booking-com/scraping.md +578 -578
  25. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/capterra/scraping.md +440 -440
  26. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/centilebrain/generate-estimates.md +110 -110
  27. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/coingecko/scraping.md +325 -325
  28. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/coinmarketcap/scraping.md +463 -463
  29. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/coursera/scraping.md +360 -360
  30. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/craigslist/scraping.md +390 -390
  31. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/crossref/scraping.md +568 -568
  32. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/dev-to/scraping.md +323 -323
  33. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/duckduckgo/scraping.md +349 -349
  34. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/ebay/scraping.md +435 -435
  35. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/etsy/scraping.md +506 -506
  36. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/eventbrite/scraping.md +363 -363
  37. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/expedia/automation.md +168 -168
  38. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/facebook/groups.md +236 -236
  39. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/facebook/pages.md +295 -295
  40. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/framer/editor.md +108 -108
  41. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/fred/scraping.md +493 -493
  42. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/g2/scraping.md +580 -580
  43. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/genius/scraping.md +511 -511
  44. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/github/repo-actions.md +65 -65
  45. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/github/scraping.md +184 -184
  46. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/glassdoor/scraping.md +543 -543
  47. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/gmail/compose.md +122 -122
  48. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/goodreads/scraping.md +461 -461
  49. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/gutenberg/scraping.md +383 -383
  50. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/hackernews/scraping.md +243 -243
  51. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/howlongtobeat/scraping.md +473 -473
  52. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/imdb/scraping.md +271 -271
  53. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/itch-io/scraping.md +436 -436
  54. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/job-boards/indeed-glassdoor.md +1021 -1021
  55. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/letterboxd/scraping.md +349 -349
  56. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/linkedin/invitation-manager.md +109 -109
  57. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/loom/folder-enumeration.md +170 -170
  58. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/macrotrends/scraping.md +537 -537
  59. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/medium/article-hydration.md +120 -120
  60. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/medium/scraping.md +414 -414
  61. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/metacritic/scraping.md +477 -477
  62. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/musicbrainz/scraping.md +478 -478
  63. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/nasa/scraping.md +339 -339
  64. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/news-aggregation/multi-source.md +205 -205
  65. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/open-library/scraping.md +472 -472
  66. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/openalex/scraping.md +470 -470
  67. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/openstreetmap/scraping.md +490 -490
  68. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/package-registries/npm-pypi.md +478 -478
  69. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/polymarket/scraping.md +234 -234
  70. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/producthunt/scraping.md +307 -307
  71. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/pubmed/scraping.md +421 -421
  72. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/quora/scraping.md +364 -364
  73. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/rawg/scraping.md +352 -352
  74. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/reddit/scraping.md +124 -124
  75. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/rest-countries/scraping.md +233 -233
  76. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/sec-edgar/scraping.md +361 -361
  77. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/shopify-admin/README.md +36 -36
  78. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/shopify-admin/embedded-apps.md +72 -72
  79. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/shopify-admin/knowledge-base.md +109 -109
  80. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/shopify-admin/polaris-inputs.md +137 -137
  81. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/soundcloud/scraping.md +362 -362
  82. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/spotify/scraping.md +339 -339
  83. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/stackoverflow/scraping.md +435 -435
  84. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/steam/scraping.md +575 -575
  85. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/substack/scraping.md +338 -338
  86. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/thetechgeeks/pricing.md +52 -52
  87. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/tiktok/upload.md +107 -107
  88. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/tradingview/scraping.md +309 -309
  89. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/trello/boards-and-lists.md +88 -88
  90. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/trustpilot/scraping.md +375 -375
  91. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/walmart/scraping.md +444 -444
  92. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/wayback-machine/scraping.md +306 -306
  93. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/weather/scraping.md +398 -398
  94. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/wellfound/scraping.md +596 -596
  95. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/world-bank/scraping.md +356 -356
  96. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/xiaohongshu/scraping.md +84 -84
  97. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/youtube/scraping.md +418 -418
  98. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/zillow/scraping.md +433 -433
  99. package/dist/extensions/builtin/browser/browser.md +73 -73
  100. package/dist/extensions/builtin/browser/install.md +142 -142
  101. package/dist/extensions/builtin/browser/interaction-skills/connection.md +48 -48
  102. package/dist/extensions/builtin/browser/interaction-skills/cookies.md +3 -3
  103. package/dist/extensions/builtin/browser/interaction-skills/cross-origin-iframes.md +3 -3
  104. package/dist/extensions/builtin/browser/interaction-skills/dialogs.md +64 -64
  105. package/dist/extensions/builtin/browser/interaction-skills/downloads.md +3 -3
  106. package/dist/extensions/builtin/browser/interaction-skills/drag-and-drop.md +3 -3
  107. package/dist/extensions/builtin/browser/interaction-skills/dropdowns.md +3 -3
  108. package/dist/extensions/builtin/browser/interaction-skills/iframes.md +3 -3
  109. package/dist/extensions/builtin/browser/interaction-skills/network-requests.md +3 -3
  110. package/dist/extensions/builtin/browser/interaction-skills/print-as-pdf.md +3 -3
  111. package/dist/extensions/builtin/browser/interaction-skills/profile-sync.md +90 -90
  112. package/dist/extensions/builtin/browser/interaction-skills/screenshots.md +17 -17
  113. package/dist/extensions/builtin/browser/interaction-skills/scrolling.md +3 -3
  114. package/dist/extensions/builtin/browser/interaction-skills/shadow-dom.md +3 -3
  115. package/dist/extensions/builtin/browser/interaction-skills/tabs.md +69 -69
  116. package/dist/extensions/builtin/browser/interaction-skills/uploads.md +1 -1
  117. package/dist/extensions/builtin/browser/interaction-skills/viewport.md +3 -3
  118. package/dist/extensions/builtin/browser/src/browser_harness/AGENT.md +15 -15
  119. package/dist/extensions/builtin/browser/src/browser_harness/__init__.py +8 -8
  120. package/dist/extensions/builtin/browser/src/browser_harness/_ipc.py +90 -90
  121. package/dist/extensions/builtin/browser/src/browser_harness/admin.py +722 -722
  122. package/dist/extensions/builtin/browser/src/browser_harness/daemon.py +328 -328
  123. package/dist/extensions/builtin/browser/src/browser_harness/helpers.py +396 -396
  124. package/dist/extensions/builtin/browser/src/browser_harness/run.py +103 -103
  125. package/dist/extensions/builtin/discipline/skills/brainstorming/SKILL.md +33 -33
  126. package/dist/extensions/builtin/discipline/skills/executing-plans/SKILL.md +25 -25
  127. package/dist/extensions/builtin/discipline/skills/finishing-development-branch/SKILL.md +25 -25
  128. package/dist/extensions/builtin/discipline/skills/receiving-code-review/SKILL.md +22 -22
  129. package/dist/extensions/builtin/discipline/skills/requesting-code-review/SKILL.md +31 -31
  130. package/dist/extensions/builtin/discipline/skills/systematic-debugging/SKILL.md +28 -28
  131. package/dist/extensions/builtin/discipline/skills/test-driven-development/SKILL.md +32 -32
  132. package/dist/extensions/builtin/discipline/skills/using-git-worktrees/SKILL.md +25 -25
  133. package/dist/extensions/builtin/discipline/skills/verification-before-completion/SKILL.md +27 -27
  134. package/dist/extensions/builtin/discipline/skills/writing-plans/SKILL.md +26 -26
  135. package/dist/extensions/builtin/goal/README.md +67 -67
  136. package/dist/extensions/builtin/goal/goal-controller.d.ts +39 -10
  137. package/dist/extensions/builtin/goal/goal-controller.js +1 -1
  138. package/dist/extensions/builtin/goal/goal-format.js +1 -1
  139. package/dist/extensions/builtin/goal/goal-prompts.d.ts +2 -0
  140. package/dist/extensions/builtin/goal/goal-prompts.js +5 -4
  141. package/dist/extensions/builtin/goal/goal-store.js +1 -1
  142. package/dist/extensions/builtin/goal/index.d.ts +1 -1
  143. package/dist/extensions/builtin/goal/index.js +10 -7
  144. package/dist/extensions/builtin/grub/README.md +112 -112
  145. package/dist/extensions/builtin/link-world/agent-workspace/README.md +16 -16
  146. package/dist/extensions/builtin/link-world/index.js +6 -6
  147. package/dist/extensions/builtin/link-world/internet-search/internet-search.md +65 -65
  148. package/dist/extensions/builtin/link-world/link-world-agent.md +82 -82
  149. package/dist/extensions/builtin/link-world/linkworld.md +313 -313
  150. package/dist/extensions/builtin/link-world/{network-routing.md → network-routing/network-routing.md} +67 -67
  151. package/dist/extensions/builtin/loop/README.md +92 -92
  152. package/dist/extensions/builtin/mcp/figma-design.md +68 -68
  153. package/dist/extensions/builtin/mcp/mcp-management.md +85 -85
  154. package/dist/extensions/builtin/plan/index.js +1 -1
  155. package/dist/extensions/builtin/recap/AGENT.md +15 -15
  156. package/dist/extensions/builtin/sal/README.md +72 -72
  157. package/dist/extensions/builtin/security-audit/README.md +289 -289
  158. package/dist/extensions/builtin/task/task-store.d.ts +4 -0
  159. package/dist/extensions/builtin/task/task-store.js +1 -1
  160. package/dist/extensions/builtin/team/AGENT.md +112 -112
  161. package/dist/extensions/builtin/team/TESTING.md +299 -299
  162. package/dist/extensions/builtin/token-save/README.md +56 -56
  163. package/dist/extensions/optional/AGENT.md +10 -10
  164. package/dist/index.d.ts +5 -30
  165. package/dist/index.js +1 -1
  166. package/dist/models.d.ts +7 -0
  167. package/dist/models.js +1 -0
  168. package/dist/modes/interactive/components/footer.js +1 -1
  169. package/dist/modes/interactive/components/task-status-panel.d.ts +36 -0
  170. package/dist/modes/interactive/components/task-status-panel.js +1 -0
  171. package/dist/modes/interactive/controllers/stream-render-controller.d.ts +7 -0
  172. package/dist/modes/interactive/controllers/stream-render-controller.js +2 -2
  173. package/dist/modes/interactive/interactive-mode.js +40 -40
  174. package/dist/modes/interactive/state/interactive-state.d.ts +2 -0
  175. package/dist/modes/interactive/state/interactive-state.js +1 -1
  176. package/dist/modes/interactive/theme/dark.json +85 -85
  177. package/dist/modes/interactive/theme/light.json +84 -84
  178. package/dist/modes/interactive/theme/theme-schema.json +335 -335
  179. package/dist/modes/interactive/theme/warm.json +81 -81
  180. package/dist/node_modules/@pencil-agent/ai/dist/cli.js +0 -0
  181. package/dist/node_modules/@pencil-agent/ai/dist/models.generated.js +1 -1
  182. package/dist/node_modules/@pencil-agent/ai/dist/providers/anthropic.js +2 -2
  183. package/dist/node_modules/@pencil-agent/ai/dist/providers/openai-completions.js +5 -5
  184. package/dist/node_modules/@pencil-agent/ai/dist/providers/openai-responses.js +1 -1
  185. package/dist/node_modules/@pencil-agent/ai/dist/stream.js +1 -1
  186. package/dist/packages/protocol/src/commands.d.ts +33 -0
  187. package/dist/packages/protocol/src/flags.d.ts +20 -0
  188. package/dist/packages/protocol/src/hooks.d.ts +17 -0
  189. package/dist/packages/protocol/src/hooks.js +0 -0
  190. package/dist/packages/{extension-sdk → protocol}/src/index.d.ts +7 -4
  191. package/dist/packages/protocol/src/index.js +1 -0
  192. package/dist/packages/{extension-sdk → protocol}/src/lifecycle.d.ts +15 -27
  193. package/dist/packages/protocol/src/lifecycle.js +0 -0
  194. package/dist/packages/{extension-sdk → protocol}/src/tools.d.ts +1 -1
  195. package/dist/packages/protocol/src/tools.js +0 -0
  196. package/dist/public-config.d.ts +12 -0
  197. package/dist/public-config.js +1 -0
  198. package/dist/runtime.d.ts +9 -0
  199. package/dist/runtime.js +1 -0
  200. package/dist/session-compaction.d.ts +7 -0
  201. package/dist/session-compaction.js +1 -0
  202. package/dist/session.d.ts +7 -0
  203. package/dist/session.js +1 -0
  204. package/dist/skills.d.ts +7 -0
  205. package/dist/skills.js +1 -0
  206. package/dist/tools.d.ts +7 -0
  207. package/dist/tools.js +1 -0
  208. package/docs/ACP/345/215/217/350/256/256/351/233/206/346/210/220/345/274/200/345/217/221/346/226/207/346/241/243.md +851 -0
  209. package/docs/SDK-TESTING.md +364 -0
  210. package/docs/codex-goal-command-impl.md +1055 -1055
  211. package/docs/codex-goal-vs-grub.md +500 -500
  212. package/docs/custom-provider.md +27 -27
  213. package/docs/extensions.md +27 -27
  214. package/docs/keybindings.md +27 -27
  215. package/docs/loop /351/207/215/346/236/204/345/256/214/346/210/220/346/200/273/347/273/223.md" +250 -250
  216. package/docs/loop /351/207/215/346/236/204/345/256/214/346/210/220/346/212/245/345/221/212.md" +122 -122
  217. package/docs/loop /351/207/215/346/236/204/346/226/271/346/241/210.md" +1222 -1222
  218. package/docs/loop /351/207/215/346/236/204/346/226/271/346/241/210/345/256/236/347/216/260/346/212/245/345/221/212.md" +158 -158
  219. package/docs/loop /351/207/215/346/236/204/346/226/271/346/241/210/345/257/271/346/257/224/345/210/206/346/236/220.md" +128 -128
  220. package/docs/loop /351/207/215/346/236/204/350/256/241/345/210/222.md" +320 -320
  221. package/docs/loop-usage-examples.md +214 -214
  222. package/docs/mem-core/346/212/200/346/234/257/346/226/207/346/241/243.md +593 -0
  223. package/docs/models.md +27 -27
  224. package/docs/packages.md +27 -27
  225. package/docs/pi-design-philosophy.md +457 -457
  226. package/docs/planmode.md +1987 -1987
  227. package/docs/prompt-templates.md +27 -27
  228. package/docs/providers.md +27 -27
  229. package/docs/sdk.md +27 -27
  230. package/docs/skills.md +27 -27
  231. package/docs/startup-performance-optimization.md +301 -0
  232. package/docs/themes.md +27 -27
  233. package/docs/tui.md +27 -27
  234. package/docs//350/256/244/347/237/245/345/234/260/345/233/276.md +47 -0
  235. package/package.json +190 -162
  236. package/dist/packages/extension-sdk/src/index.js +0 -1
  237. package/docs/cc-agent-design.md +0 -1297
  238. package/docs/cc-tui-design.md +0 -1333
  239. package/docs//345/257/271/346/240/207Claude-Code.md +0 -1775
  240. /package/dist/packages/{extension-sdk/src/lifecycle.js → protocol/src/commands.js} +0 -0
  241. /package/dist/packages/{extension-sdk/src/tools.js → protocol/src/flags.js} +0 -0
@@ -1,578 +1,578 @@
1
- # Booking.com — Scraping & Data Extraction
2
-
3
- Field-tested against booking.com on 2026-04-18 using `http_get` and the
4
- `dml/graphql` JSON API. All tests run without a browser session.
5
-
6
- ---
7
-
8
- ## TL;DR
9
-
10
- **`http_get` returns nothing useful from booking.com.** Every HTML page —
11
- search results, hotel pages, city pages, the homepage — is intercepted by an
12
- AWS WAF JS challenge before any content is served. The challenge requires
13
- JavaScript execution to complete a cryptographic puzzle and set an
14
- `aws-waf-token` cookie. Without a real browser, you get a ~4-8 KB stub page.
15
-
16
- **What you can do without a browser:**
17
- - Enumerate hotel/city/region URLs from XML sitemaps (Googlebot UA required).
18
- - Read `robots.txt` for URL pattern documentation.
19
- - Query the GraphQL endpoint `https://www.booking.com/dml/graphql` for schema
20
- exploration (no auth = internal errors, but validation errors reveal the
21
- schema).
22
-
23
- **For all actual data extraction, use the browser (`goto` + `js`).**
24
-
25
- ---
26
-
27
- ## AWS WAF JS Challenge — What It Is
28
-
29
- Every `http_get` request to `www.booking.com` receives one of two variants of
30
- a WAF stub:
31
-
32
- **Variant A (~3,962 bytes) — modern SDK:**
33
- ```html
34
- <script src="https://www.booking.com/__challenge_{KEY}/{HASH}/challenge.js"></script>
35
- <script>
36
- AwsWafIntegration.getToken().then(() => { window.location.href = newHref; });
37
- </script>
38
- ```
39
-
40
- **Variant B (~8,410 bytes) — with AJAX error reporting:**
41
- Same AWS WAF SDK, plus an `XMLHttpRequest`-based error reporter that POSTs to
42
- `https://reports.booking.com/chal_report`. This variant is more common on
43
- non-browser UA strings.
44
-
45
- **Detection in your code:**
46
- ```python
47
- def is_waf_blocked(html: str) -> bool:
48
- return (
49
- 'AwsWafIntegration' in html
50
- or 'awsWafCookieDomainList' in html
51
- or 'challenge.js' in html
52
- or len(html) < 10_000 and '<title></title>' in html
53
- )
54
- ```
55
-
56
- **What the challenge does:**
57
- 1. Loads a 1.3 MB obfuscated JS file (`challenge.js`) from a path-keyed URL.
58
- 2. Executes a cryptographic proof-of-work puzzle client-side.
59
- 3. Sets an `aws-waf-token` cookie on the `booking.com` domain.
60
- 4. Redirects to the original URL with `?chal_t={timestamp}&force_referer=`
61
- appended.
62
-
63
- This challenge **cannot be solved by `http_get`**. It requires a real JS
64
- engine. A `bkng` session cookie is set on the first blocked response, but it
65
- has no value without the WAF token.
66
-
67
- **User agents tested — all blocked:**
68
- - Chrome desktop (`Mozilla/5.0 ... Chrome/120`)
69
- - iPhone/Safari mobile
70
- - `Googlebot/2.1` (HTML pages only; sitemaps are whitelisted)
71
- - Default `urllib` UA
72
-
73
- ---
74
-
75
- ## What `http_get` CAN Access
76
-
77
- ### 1. XML Sitemaps (URL discovery)
78
-
79
- Booking.com whitelists sitemap paths for Googlebot. This lets you enumerate
80
- millions of property, city, region, and attraction URLs without a browser.
81
-
82
- ```python
83
- import gzip, re, urllib.request
84
-
85
- GOOGLEBOT = {"User-Agent": "Googlebot/2.1 (+http://www.google.com/bot.html)"}
86
-
87
- def fetch_sitemap_index(url: str) -> list[str]:
88
- """Returns list of child sitemap URLs from an index sitemap."""
89
- xml = http_get(url, headers=GOOGLEBOT)
90
- return re.findall(r'<loc>(https://[^<]+)</loc>', xml)
91
-
92
- def fetch_sitemap_gz(gz_url: str) -> list[str]:
93
- """Decompresses a gzipped sitemap and returns all <loc> URLs."""
94
- req = urllib.request.Request(gz_url, headers=GOOGLEBOT)
95
- with urllib.request.urlopen(req, timeout=30) as r:
96
- data = gzip.decompress(r.read())
97
- return re.findall(r'<loc>(https://[^<]+)</loc>', data.decode())
98
-
99
- # Example: get all en-gb hotel URLs
100
- hotel_idx = http_get(
101
- "https://www.booking.com/sitembk-hotel-index.xml",
102
- headers=GOOGLEBOT
103
- )
104
- # 74 shards for en-gb; each shard has ~45,000-50,000 property URLs
105
- en_gb_shards = re.findall(
106
- r'<loc>(https://www\.booking\.com/sitembk-hotel-en-gb\.\d+\.xml\.gz)</loc>',
107
- hotel_idx
108
- )
109
- # hotel_urls = fetch_sitemap_gz(en_gb_shards[0]) # ~50K URLs per shard
110
- ```
111
-
112
- **Available sitemap categories (confirmed, 275 total):**
113
-
114
- | Index URL | Content |
115
- |-----------|---------|
116
- | `sitembk-hotel-index.xml` | All properties (~74 en-gb shards, ~3.5M URLs) |
117
- | `sitembk-city-index.xml` | City landing pages (~6 en-gb shards, ~44K cities) |
118
- | `sitembk-region-index.xml` | Region landing pages |
119
- | `sitembk-country-index.xml` | Country landing pages |
120
- | `sitembk-attractions-index.xml` | Attractions |
121
- | `sitembk-hotel-review-index.xml` | Review pages |
122
- | `sitembk-themed-city-{type}-index.xml` | Category-specific city pages (70+ types: hostels, luxury, spa, ski, etc.) |
123
-
124
- ### 2. `robots.txt`
125
-
126
- ```python
127
- robots = http_get("https://www.booking.com/robots.txt", headers={"User-Agent": "Mozilla/5.0"})
128
- ```
129
-
130
- - Returns immediately, no WAF.
131
- - 136 Disallow entries, 275 Sitemap declarations.
132
- - Documents all URL structures (search results, hotel pages, booking flow, etc.).
133
-
134
- ### 3. GraphQL Schema Exploration (no auth)
135
-
136
- The endpoint `https://www.booking.com/dml/graphql` is **not WAF-protected**.
137
- It accepts POST requests and returns JSON. Without a session, most queries
138
- return `Internal Server Error` from the backend (`irene` service), but
139
- **GraphQL validation errors fire before the backend** and reveal the schema.
140
-
141
- ```python
142
- import json, urllib.request, gzip
143
-
144
- GQL_URL = "https://www.booking.com/dml/graphql?lang=en-gb"
145
- GQL_HEADERS = {
146
- "User-Agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36",
147
- "Accept": "application/json",
148
- "Content-Type": "application/json",
149
- "Origin": "https://www.booking.com",
150
- "Referer": "https://www.booking.com/searchresults.html",
151
- "x-booking-context-action-name": "searchresults",
152
- "x-booking-context-aid": "376510",
153
- "x-booking-site-type-id": "1",
154
- }
155
-
156
- def gql(operation_name: str, query: str, variables: dict = None) -> dict:
157
- payload = {"operationName": operation_name, "query": query}
158
- if variables:
159
- payload["variables"] = variables
160
- req = urllib.request.Request(
161
- GQL_URL,
162
- data=json.dumps(payload).encode(),
163
- headers=GQL_HEADERS,
164
- method="POST"
165
- )
166
- with urllib.request.urlopen(req, timeout=20) as r:
167
- data = r.read()
168
- if r.headers.get("Content-Encoding") == "gzip":
169
- data = gzip.decompress(data)
170
- return json.loads(data.decode())
171
- ```
172
-
173
- **Confirmed Query type fields (schema, field-tested 2026-04-18):**
174
-
175
- | Field | Input type | Notes |
176
- |-------|-----------|-------|
177
- | `searchQueries` | none | Root for hotel search; nested `.search(SearchQueryInput!)` |
178
- | `searchBox` | `SearchBoxInput!` | Destination autocomplete / search form state |
179
- | `searchProperties` | `SearchInput!` | Returns 500 without auth session |
180
- | `propertyDetails` | `PropertyDetailsQueryInput!` | Returns 500 without auth session |
181
- | `popularDestinations` | `PopularDestinationsInput!` | Returns validation error (type mismatch) |
182
-
183
- **Important:** Booking.com GraphQL uses an **operation name whitelist** for
184
- some operations. If you get `GRAPHQL_UNKNOWN_OPERATION_NAME`, try any of the
185
- following confirmed working names: `SearchResultsPage`, `SearchQuery`,
186
- `HotelCardsList`, `SearchResultsList`, `PropertySearch`, `BookingSearch`.
187
-
188
- **Operation names that bypass the whitelist restriction** (all return
189
- `{ data: { __typename: 'Query' } }` with `{ __typename }`):
190
- - `SearchResultsPage` ✓ (confirmed, use this)
191
-
192
- **The search query structure** (known but returns 500 without session):
193
- ```graphql
194
- query SearchResultsPage($input: SearchQueryInput!) {
195
- searchQueries {
196
- search(input: $input) {
197
- __typename # Returns SearchQueryResult type
198
- }
199
- }
200
- }
201
- ```
202
-
203
- With `SearchQueryInput` fields (inferred from URL parameters, confirmed
204
- accepted by validation):
205
- ```json
206
- {
207
- "dest_id": "-1456928",
208
- "dest_type": "CITY",
209
- "checkin": "2026-05-01",
210
- "checkout": "2026-05-03",
211
- "group_adults": "2",
212
- "no_rooms": "1",
213
- "group_children": "0",
214
- "selected_currency": "USD"
215
- }
216
- ```
217
-
218
- ---
219
-
220
- ## URL Parameter Reference
221
-
222
- ### Search Results
223
- `https://www.booking.com/searchresults.html`
224
-
225
- | Parameter | Type | Example | Notes |
226
- |-----------|------|---------|-------|
227
- | `ss` | string | `Paris` | Free-text: city, hotel name, address |
228
- | `dest_id` | string | `-1456928` | Numeric city/region ID (negative = city) |
229
- | `dest_type` | string | `CITY` | `CITY`, `REGION`, `COUNTRY`, `HOTEL`, `AIRPORT`, `DISTRICT`, `LANDMARK` |
230
- | `checkin` | `YYYY-MM-DD` | `2026-05-01` | |
231
- | `checkout` | `YYYY-MM-DD` | `2026-05-03` | |
232
- | `group_adults` | int | `2` | |
233
- | `no_rooms` | int | `1` | |
234
- | `group_children` | int | `0` | |
235
- | `age` | int (repeatable) | `5` | Child age; one per child |
236
- | `selected_currency` | string | `USD` | ISO 4217 currency code |
237
- | `lang` | string | `en-us` | BCP 47 locale |
238
- | `nflt` | string | `ht_id=204;class=4` | Semicolon-separated filters |
239
- | `order` | string | `price` | Sort: `price`, `class`, `review_score`, `distance`, `upsort_bh` |
240
- | `offset` | int | `25` | Pagination (0-based, step 25) |
241
- | `rows` | int | `25` | Results per page (max 25) |
242
- | `map` | `1` | `1` | Map view mode |
243
- | `src` | string | `searchresults` | Source context (cosmetic) |
244
-
245
- **Common `nflt` filter codes:**
246
- - `ht_id=204` — Hotels only
247
- - `class=3;class=4;class=5` — Star rating
248
- - `review_score=90` — Guest rating ≥ 9.0
249
- - `fc=2` — Free cancellation
250
- - `rm_types=…` — Room type
251
- - `pri=1;pri=2` — Price tier (budget / mid / upscale)
252
-
253
- ### Property Pages
254
- `https://www.booking.com/hotel/{country_code}/{hotel_slug}.html`
255
-
256
- Confirmed from sitemap (74 shards, ~3.5M properties):
257
- ```
258
- https://www.booking.com/hotel/{cc}/{slug}.html
259
- https://www.booking.com/hotel/{cc}/{slug}.en-gb.html
260
- https://www.booking.com/hotel/{cc}/{slug}.{lang}.html
261
- ```
262
- - `cc` = 2-letter ISO country code (e.g., `fr`, `us`, `gb`, `de`, `jp`)
263
- - `slug` = hotel name, lowercase, hyphen-separated
264
- - Locale suffix optional; omit for default (English)
265
-
266
- ### City / Region / Country Pages
267
- ```
268
- https://www.booking.com/city/{cc}/{city-slug}.html
269
- https://www.booking.com/region/{cc}/{region-slug}.html
270
- https://www.booking.com/country/{cc}.html
271
- ```
272
-
273
- ---
274
-
275
- ## Browser-Based Extraction (Required for All Data)
276
-
277
- Since `http_get` is blocked, all actual data extraction requires the browser
278
- (`goto` + `js`). The WAF challenge resolves automatically in Chrome.
279
-
280
- ### Initial Navigation
281
-
282
- ```python
283
- # Always use new_tab() for the first Booking.com load in a session
284
- tid = new_tab("https://www.booking.com/searchresults.html?ss=Paris&checkin=2026-05-01&checkout=2026-05-03&group_adults=2&no_rooms=1&selected_currency=USD")
285
- wait_for_load()
286
- wait(3) # React hydration takes ~3s after readyState=complete
287
-
288
- # Check for WAF challenge still running (rare in real Chrome)
289
- url = page_info()["url"]
290
- if "chal_t=" in url:
291
- wait(5) # WAF challenge resolving
292
- wait_for_load()
293
- ```
294
-
295
- ### GDPR / Cookie Consent Banner (EU Visitors)
296
-
297
- Shown to visitors with EU IP addresses or EU `Accept-Language` headers **after**
298
- the WAF challenge resolves. It blocks interaction until dismissed.
299
-
300
- ```python
301
- def dismiss_cookie_banner():
302
- # Booking.com uses data-testid="accept" on the Accept button
303
- accepted = js("""
304
- (function() {
305
- var btn = document.querySelector('[data-testid="accept"]')
306
- || document.querySelector('#onetrust-accept-btn-handler')
307
- || document.querySelector('[aria-label*="Accept"]');
308
- if (btn) { btn.click(); return true; }
309
- return false;
310
- })()
311
- """)
312
- return accepted
313
-
314
- # Call immediately after load if you have an EU IP
315
- if dismiss_cookie_banner():
316
- wait(1)
317
- ```
318
-
319
- The consent banner does **not** appear in the WAF stub — it only renders after
320
- the full React app loads. Non-EU visitors (US IP, `Accept-Language: en-US`)
321
- may not see it at all.
322
-
323
- ### Search Results Page Extraction
324
-
325
- ```python
326
- results = js("""
327
- Array.from(document.querySelectorAll('[data-testid="property-card"]')).map(el => ({
328
- name: el.querySelector('[data-testid="title"]')?.innerText?.trim(),
329
- url: el.querySelector('[data-testid="title-link"]')?.href,
330
- price: el.querySelector('[data-testid="price-and-discounted-price"]')?.innerText?.trim(),
331
- rating: el.querySelector('[data-testid="review-score"]')?.innerText?.trim(),
332
- stars: el.querySelectorAll('[data-testid="rating-stars"] svg').length,
333
- location: el.querySelector('[data-testid="address"]')?.innerText?.trim(),
334
- availability_note: el.querySelector('[data-testid="availability-rate-information"]')?.innerText?.trim(),
335
- is_genius: !!el.querySelector('[data-testid="genius-label"]'),
336
- }))
337
- """)
338
- ```
339
-
340
- **Field notes:**
341
- - `data-testid="property-card"` — confirmed selector for result cards (as of
342
- 2025-2026; Booking migrated from `sr-hotel` class to data-testid attributes).
343
- - `data-testid="price-and-discounted-price"` — contains the nightly rate;
344
- may show original + discounted price together as text.
345
- - `data-testid="review-score"` — contains both the numeric score (e.g.,
346
- `"9.2"`) and the label (e.g., `"Superb"`); use `.split('\n')[0]` for score.
347
- - `data-testid="rating-stars"` — star rating icons; count SVG children for
348
- star count.
349
- - Results are loaded asynchronously; 3s wait after `wait_for_load()` is
350
- required for all cards to render.
351
-
352
- ### Pagination
353
-
354
- ```python
355
- # Method 1: Next page button
356
- next_btn = js("document.querySelector('[data-testid=\"pagination-next\"]')?.href")
357
- if next_btn:
358
- goto_url(next_btn)
359
- wait_for_load()
360
- wait(3)
361
-
362
- # Method 2: Offset parameter (25 results per page)
363
- current_url = page_info()["url"]
364
- offset = 25 # next page
365
- goto_url(current_url + f"&offset={offset}")
366
- wait_for_load()
367
- wait(3)
368
- ```
369
-
370
- ### Property / Hotel Page Extraction
371
-
372
- ```python
373
- detail = js("""
374
- ({
375
- name: document.querySelector('[data-testid="property-name"]')?.innerText?.trim()
376
- || document.querySelector('h2.hp__hotel-name, h1.pp-hotel-name-title')?.innerText?.trim(),
377
- rating: document.querySelector('[data-testid="rating-squares"]')
378
- ? document.querySelectorAll('[data-testid="rating-squares"] svg').length
379
- : null,
380
- score: document.querySelector('[data-testid="review-score-right-component"] .ac4a7896c7')?.innerText
381
- || document.querySelector('[aria-label*="Scored"]')?.getAttribute('aria-label'),
382
- address: document.querySelector('[data-testid="PropertyHeaderAddressDesktop"]')?.innerText?.trim()
383
- || document.querySelector('[id="hotel_address"]')?.innerText?.trim(),
384
- description: document.querySelector('[data-testid="property-description-content"]')?.innerText?.trim()
385
- || document.querySelector('#property_description_content')?.innerText?.trim(),
386
- amenities: Array.from(document.querySelectorAll('[data-testid="facility-list-item"]'))
387
- .map(e => e.innerText?.trim()).filter(Boolean),
388
- room_types: Array.from(document.querySelectorAll('[data-testid="roomstable-accordion"]'))
389
- .map(el => ({
390
- name: el.querySelector('[data-testid="room-type-name"]')?.innerText?.trim(),
391
- price: el.querySelector('[data-testid="price-and-discounted-price"]')?.innerText?.trim(),
392
- })),
393
- lat: document.querySelector('a[href*="maps.google"]')
394
- ?.href?.match(/[?&]q=([^&]+)/)?.[1]?.split(',')[0],
395
- lon: document.querySelector('a[href*="maps.google"]')
396
- ?.href?.match(/[?&]q=([^&]+)/)?.[1]?.split(',')[1],
397
- })
398
- """)
399
- ```
400
-
401
- ### JSON-LD Schema (Property Pages)
402
-
403
- Property pages embed JSON-LD when fully rendered in browser. The schema type
404
- is `Hotel`:
405
-
406
- ```python
407
- ld_json = js("""
408
- (function() {
409
- for (var s of document.querySelectorAll('script[type="application/ld+json"]')) {
410
- try {
411
- var d = JSON.parse(s.textContent);
412
- if (d['@type'] === 'Hotel' || d['@type'] === 'LodgingBusiness') return d;
413
- } catch(e) {}
414
- }
415
- return null;
416
- })()
417
- """)
418
- # Returns:
419
- # {
420
- # "@type": "Hotel",
421
- # "name": "Hotel de Crillon",
422
- # "aggregateRating": {"ratingValue": "9.2", "reviewCount": "1423"},
423
- # "address": {"streetAddress": "10 Place de la Concorde", "addressLocality": "Paris", ...},
424
- # "geo": {"latitude": 48.865, "longitude": 2.321},
425
- # "starRating": {"ratingValue": 5}
426
- # }
427
- ```
428
-
429
- JSON-LD is **not present in the WAF stub** — it only exists in the fully
430
- rendered page. `http_get` will never see it.
431
-
432
- ### Embedded JavaScript Data (`__NEXT_DATA__` / `b_hotel_data`)
433
-
434
- Booking.com's React app may embed search state in `window.__NEXT_DATA__` or
435
- legacy `b_hotel_data` globals. Access via:
436
-
437
- ```python
438
- next_data = js("window.__NEXT_DATA__") # dict or None
439
- b_hotel = js("window.b_hotel_data") # dict or None — legacy pages
440
- ```
441
-
442
- These globals are not present in the WAF stub and their availability depends
443
- on page version. Prefer data-testid selectors which are more stable.
444
-
445
- ---
446
-
447
- ## Pricing Extraction Patterns
448
-
449
- Booking.com shows prices per night with multiple formatting variants:
450
-
451
- ```python
452
- price_patterns = js("""
453
- ({
454
- // Search results card price
455
- search_price: document.querySelector('[data-testid="price-and-discounted-price"]')?.innerText,
456
- // Property page room price
457
- room_price: document.querySelector('[data-testid="price-and-discounted-price"]')?.innerText,
458
- // Original (crossed-out) price before discount
459
- original_price: document.querySelector('[data-testid="recommended-units-price"] s')?.innerText
460
- || document.querySelector('.prco-valign-middle-helper del')?.innerText,
461
- // "Price for X nights" summary
462
- total_price: document.querySelector('[data-testid="checkout-price-summary"]')?.innerText,
463
- // Genius discount tag
464
- genius_discount: document.querySelector('[data-testid="genius-rate-badge"]')?.innerText,
465
- })
466
- """)
467
- ```
468
-
469
- **Price display nuances:**
470
- - Prices shown are **per night** by default; multiply by nights for total.
471
- - Currency is controlled by `selected_currency` URL param or user account
472
- setting.
473
- - Taxes/fees may or may not be included; look for `"Includes taxes and fees"`
474
- or `"+ taxes & fees"` text adjacent to the price element.
475
- - The `data-testid="price-and-discounted-price"` element returns a single
476
- string that may contain both original and discounted price
477
- (e.g., `"US$400\nUS$320"`).
478
-
479
- ---
480
-
481
- ## WAF Detection & Handling in Browser
482
-
483
- The WAF resolves automatically in a real Chrome session. To detect if
484
- something went wrong:
485
-
486
- ```python
487
- def check_booking_waf():
488
- url = page_info()["url"]
489
- html_snippet = js("document.body?.innerHTML?.slice(0, 500)") or ""
490
- return (
491
- "chal_t=" in url
492
- or "AwsWafIntegration" in html_snippet
493
- or "challenge-container" in html_snippet
494
- )
495
-
496
- def wait_past_waf(timeout=15):
497
- import time
498
- deadline = time.time() + timeout
499
- while time.time() < deadline:
500
- if not check_booking_waf():
501
- return True
502
- wait(1)
503
- return False # timed out — WAF didn't resolve
504
-
505
- # Use after goto_url():
506
- goto_url("https://www.booking.com/searchresults.html?ss=London&checkin=2026-06-01&checkout=2026-06-03&group_adults=2&no_rooms=1")
507
- wait_for_load()
508
- wait_past_waf()
509
- wait(2) # hydration
510
- ```
511
-
512
- ---
513
-
514
- ## Sitemap-Based URL Discovery Workflow
515
-
516
- Use this when you need a list of property URLs for a given country or city,
517
- without needing to scrape search results pages in the browser:
518
-
519
- ```python
520
- import gzip, re, urllib.request
521
-
522
- GOOGLEBOT = {"User-Agent": "Googlebot/2.1 (+http://www.google.com/bot.html)"}
523
-
524
- def get_hotel_urls_for_country(cc: str, lang: str = "en-gb", max_shards: int = 2) -> list[str]:
525
- """Returns property page URLs for a country from sitemaps. No browser needed."""
526
- idx_url = f"https://www.booking.com/sitembk-hotel-index.xml"
527
- idx = http_get(idx_url, headers=GOOGLEBOT)
528
- pattern = rf'<loc>(https://www\.booking\.com/sitembk-hotel-{lang}\.\d+\.xml\.gz)</loc>'
529
- shards = re.findall(pattern, idx)[:max_shards]
530
-
531
- urls = []
532
- for shard_url in shards:
533
- req = urllib.request.Request(shard_url, headers=GOOGLEBOT)
534
- with urllib.request.urlopen(req, timeout=60) as r:
535
- xml = gzip.decompress(r.read()).decode()
536
- all_urls = re.findall(r'<loc>(https://[^<]+)</loc>', xml)
537
- # Filter by country code
538
- country_urls = [u for u in all_urls if f"/hotel/{cc}/" in u]
539
- urls.extend(country_urls)
540
- return urls
541
-
542
- # Example: get French hotel URLs (no browser needed, instant)
543
- # french_hotels = get_hotel_urls_for_country("fr", max_shards=1)
544
- # len(french_hotels) -> ~8,000+ URLs from one shard
545
- ```
546
-
547
- ---
548
-
549
- ## Gotchas
550
-
551
- - **WAF blocks everything via `http_get`** — there is no User-Agent or header
552
- combination that bypasses it. The challenge is cryptographic, not heuristic.
553
- - **WAF has two page sizes** — ~3,962 bytes (newer SDK, no AJAX reporter) and
554
- ~8,410 bytes (older with error reporting). Both are equally blocked.
555
- - **Sitemaps whitelist Googlebot UA** — `Googlebot/2.1` UA works for sitemap
556
- XML/GZ files but NOT for hotel/city/search HTML pages.
557
- - **GraphQL endpoint is unprotected** but useless without a valid Booking.com
558
- session (irene service requires authentication for all substantive queries).
559
- - **GraphQL op-name whitelist**: introspection (`__schema`) is blocked by
560
- operation name restriction. Use field validation errors to probe the schema.
561
- - **GDPR consent banner**: shown after WAF resolves, before React renders
562
- search results. Must be dismissed (click `[data-testid="accept"]`) before
563
- interacting with EU sessions. Non-EU IPs may not see it.
564
- - **React hydration delay**: `wait_for_load()` fires before card data renders.
565
- Always add 2-3s of `wait()` after `wait_for_load()`.
566
- - **`sr-hotel` class is legacy** — Booking.com migrated to data-testid
567
- attributes. Use `[data-testid="property-card"]`, not `.sr-hotel`.
568
- - **Price parsing**: the price element often contains the full string
569
- `"US$400\nUS$320"` when a discount applies. Split on `\n` and take the last
570
- item for current price.
571
- - **Offset pagination cap**: Booking caps results at 1,000 properties per
572
- search (offset 0–975, rows=25). For cities with >1,000 properties, use
573
- filters (`nflt`) to segment results.
574
- - **Currency must be set via URL param**: `selected_currency=USD` in the search
575
- URL; the cookie-based currency selection may not persist across navigation.
576
- - **`dest_id` for cities**: Paris = `-1456928`, Amsterdam = `-2140479`,
577
- London = `-2601889`. Negative integers indicate city-level destinations.
578
- Get the ID by reading it from the URL after using `ss=` search.
1
+ # Booking.com — Scraping & Data Extraction
2
+
3
+ Field-tested against booking.com on 2026-04-18 using `http_get` and the
4
+ `dml/graphql` JSON API. All tests run without a browser session.
5
+
6
+ ---
7
+
8
+ ## TL;DR
9
+
10
+ **`http_get` returns nothing useful from booking.com.** Every HTML page —
11
+ search results, hotel pages, city pages, the homepage — is intercepted by an
12
+ AWS WAF JS challenge before any content is served. The challenge requires
13
+ JavaScript execution to complete a cryptographic puzzle and set an
14
+ `aws-waf-token` cookie. Without a real browser, you get a ~4-8 KB stub page.
15
+
16
+ **What you can do without a browser:**
17
+ - Enumerate hotel/city/region URLs from XML sitemaps (Googlebot UA required).
18
+ - Read `robots.txt` for URL pattern documentation.
19
+ - Query the GraphQL endpoint `https://www.booking.com/dml/graphql` for schema
20
+ exploration (no auth = internal errors, but validation errors reveal the
21
+ schema).
22
+
23
+ **For all actual data extraction, use the browser (`goto` + `js`).**
24
+
25
+ ---
26
+
27
+ ## AWS WAF JS Challenge — What It Is
28
+
29
+ Every `http_get` request to `www.booking.com` receives one of two variants of
30
+ a WAF stub:
31
+
32
+ **Variant A (~3,962 bytes) — modern SDK:**
33
+ ```html
34
+ <script src="https://www.booking.com/__challenge_{KEY}/{HASH}/challenge.js"></script>
35
+ <script>
36
+ AwsWafIntegration.getToken().then(() => { window.location.href = newHref; });
37
+ </script>
38
+ ```
39
+
40
+ **Variant B (~8,410 bytes) — with AJAX error reporting:**
41
+ Same AWS WAF SDK, plus an `XMLHttpRequest`-based error reporter that POSTs to
42
+ `https://reports.booking.com/chal_report`. This variant is more common on
43
+ non-browser UA strings.
44
+
45
+ **Detection in your code:**
46
+ ```python
47
+ def is_waf_blocked(html: str) -> bool:
48
+ return (
49
+ 'AwsWafIntegration' in html
50
+ or 'awsWafCookieDomainList' in html
51
+ or 'challenge.js' in html
52
+ or len(html) < 10_000 and '<title></title>' in html
53
+ )
54
+ ```
55
+
56
+ **What the challenge does:**
57
+ 1. Loads a 1.3 MB obfuscated JS file (`challenge.js`) from a path-keyed URL.
58
+ 2. Executes a cryptographic proof-of-work puzzle client-side.
59
+ 3. Sets an `aws-waf-token` cookie on the `booking.com` domain.
60
+ 4. Redirects to the original URL with `?chal_t={timestamp}&force_referer=`
61
+ appended.
62
+
63
+ This challenge **cannot be solved by `http_get`**. It requires a real JS
64
+ engine. A `bkng` session cookie is set on the first blocked response, but it
65
+ has no value without the WAF token.
66
+
67
+ **User agents tested — all blocked:**
68
+ - Chrome desktop (`Mozilla/5.0 ... Chrome/120`)
69
+ - iPhone/Safari mobile
70
+ - `Googlebot/2.1` (HTML pages only; sitemaps are whitelisted)
71
+ - Default `urllib` UA
72
+
73
+ ---
74
+
75
+ ## What `http_get` CAN Access
76
+
77
+ ### 1. XML Sitemaps (URL discovery)
78
+
79
+ Booking.com whitelists sitemap paths for Googlebot. This lets you enumerate
80
+ millions of property, city, region, and attraction URLs without a browser.
81
+
82
+ ```python
83
+ import gzip, re, urllib.request
84
+
85
+ GOOGLEBOT = {"User-Agent": "Googlebot/2.1 (+http://www.google.com/bot.html)"}
86
+
87
+ def fetch_sitemap_index(url: str) -> list[str]:
88
+ """Returns list of child sitemap URLs from an index sitemap."""
89
+ xml = http_get(url, headers=GOOGLEBOT)
90
+ return re.findall(r'<loc>(https://[^<]+)</loc>', xml)
91
+
92
+ def fetch_sitemap_gz(gz_url: str) -> list[str]:
93
+ """Decompresses a gzipped sitemap and returns all <loc> URLs."""
94
+ req = urllib.request.Request(gz_url, headers=GOOGLEBOT)
95
+ with urllib.request.urlopen(req, timeout=30) as r:
96
+ data = gzip.decompress(r.read())
97
+ return re.findall(r'<loc>(https://[^<]+)</loc>', data.decode())
98
+
99
+ # Example: get all en-gb hotel URLs
100
+ hotel_idx = http_get(
101
+ "https://www.booking.com/sitembk-hotel-index.xml",
102
+ headers=GOOGLEBOT
103
+ )
104
+ # 74 shards for en-gb; each shard has ~45,000-50,000 property URLs
105
+ en_gb_shards = re.findall(
106
+ r'<loc>(https://www\.booking\.com/sitembk-hotel-en-gb\.\d+\.xml\.gz)</loc>',
107
+ hotel_idx
108
+ )
109
+ # hotel_urls = fetch_sitemap_gz(en_gb_shards[0]) # ~50K URLs per shard
110
+ ```
111
+
112
+ **Available sitemap categories (confirmed, 275 total):**
113
+
114
+ | Index URL | Content |
115
+ |-----------|---------|
116
+ | `sitembk-hotel-index.xml` | All properties (~74 en-gb shards, ~3.5M URLs) |
117
+ | `sitembk-city-index.xml` | City landing pages (~6 en-gb shards, ~44K cities) |
118
+ | `sitembk-region-index.xml` | Region landing pages |
119
+ | `sitembk-country-index.xml` | Country landing pages |
120
+ | `sitembk-attractions-index.xml` | Attractions |
121
+ | `sitembk-hotel-review-index.xml` | Review pages |
122
+ | `sitembk-themed-city-{type}-index.xml` | Category-specific city pages (70+ types: hostels, luxury, spa, ski, etc.) |
123
+
124
+ ### 2. `robots.txt`
125
+
126
+ ```python
127
+ robots = http_get("https://www.booking.com/robots.txt", headers={"User-Agent": "Mozilla/5.0"})
128
+ ```
129
+
130
+ - Returns immediately, no WAF.
131
+ - 136 Disallow entries, 275 Sitemap declarations.
132
+ - Documents all URL structures (search results, hotel pages, booking flow, etc.).
133
+
134
+ ### 3. GraphQL Schema Exploration (no auth)
135
+
136
+ The endpoint `https://www.booking.com/dml/graphql` is **not WAF-protected**.
137
+ It accepts POST requests and returns JSON. Without a session, most queries
138
+ return `Internal Server Error` from the backend (`irene` service), but
139
+ **GraphQL validation errors fire before the backend** and reveal the schema.
140
+
141
+ ```python
142
+ import json, urllib.request, gzip
143
+
144
+ GQL_URL = "https://www.booking.com/dml/graphql?lang=en-gb"
145
+ GQL_HEADERS = {
146
+ "User-Agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36",
147
+ "Accept": "application/json",
148
+ "Content-Type": "application/json",
149
+ "Origin": "https://www.booking.com",
150
+ "Referer": "https://www.booking.com/searchresults.html",
151
+ "x-booking-context-action-name": "searchresults",
152
+ "x-booking-context-aid": "376510",
153
+ "x-booking-site-type-id": "1",
154
+ }
155
+
156
+ def gql(operation_name: str, query: str, variables: dict = None) -> dict:
157
+ payload = {"operationName": operation_name, "query": query}
158
+ if variables:
159
+ payload["variables"] = variables
160
+ req = urllib.request.Request(
161
+ GQL_URL,
162
+ data=json.dumps(payload).encode(),
163
+ headers=GQL_HEADERS,
164
+ method="POST"
165
+ )
166
+ with urllib.request.urlopen(req, timeout=20) as r:
167
+ data = r.read()
168
+ if r.headers.get("Content-Encoding") == "gzip":
169
+ data = gzip.decompress(data)
170
+ return json.loads(data.decode())
171
+ ```
172
+
173
+ **Confirmed Query type fields (schema, field-tested 2026-04-18):**
174
+
175
+ | Field | Input type | Notes |
176
+ |-------|-----------|-------|
177
+ | `searchQueries` | none | Root for hotel search; nested `.search(SearchQueryInput!)` |
178
+ | `searchBox` | `SearchBoxInput!` | Destination autocomplete / search form state |
179
+ | `searchProperties` | `SearchInput!` | Returns 500 without auth session |
180
+ | `propertyDetails` | `PropertyDetailsQueryInput!` | Returns 500 without auth session |
181
+ | `popularDestinations` | `PopularDestinationsInput!` | Returns validation error (type mismatch) |
182
+
183
+ **Important:** Booking.com GraphQL uses an **operation name whitelist** for
184
+ some operations. If you get `GRAPHQL_UNKNOWN_OPERATION_NAME`, try any of the
185
+ following confirmed working names: `SearchResultsPage`, `SearchQuery`,
186
+ `HotelCardsList`, `SearchResultsList`, `PropertySearch`, `BookingSearch`.
187
+
188
+ **Operation names that bypass the whitelist restriction** (all return
189
+ `{ data: { __typename: 'Query' } }` with `{ __typename }`):
190
+ - `SearchResultsPage` ✓ (confirmed, use this)
191
+
192
+ **The search query structure** (known but returns 500 without session):
193
+ ```graphql
194
+ query SearchResultsPage($input: SearchQueryInput!) {
195
+ searchQueries {
196
+ search(input: $input) {
197
+ __typename # Returns SearchQueryResult type
198
+ }
199
+ }
200
+ }
201
+ ```
202
+
203
+ With `SearchQueryInput` fields (inferred from URL parameters, confirmed
204
+ accepted by validation):
205
+ ```json
206
+ {
207
+ "dest_id": "-1456928",
208
+ "dest_type": "CITY",
209
+ "checkin": "2026-05-01",
210
+ "checkout": "2026-05-03",
211
+ "group_adults": "2",
212
+ "no_rooms": "1",
213
+ "group_children": "0",
214
+ "selected_currency": "USD"
215
+ }
216
+ ```
217
+
218
+ ---
219
+
220
+ ## URL Parameter Reference
221
+
222
+ ### Search Results
223
+ `https://www.booking.com/searchresults.html`
224
+
225
+ | Parameter | Type | Example | Notes |
226
+ |-----------|------|---------|-------|
227
+ | `ss` | string | `Paris` | Free-text: city, hotel name, address |
228
+ | `dest_id` | string | `-1456928` | Numeric city/region ID (negative = city) |
229
+ | `dest_type` | string | `CITY` | `CITY`, `REGION`, `COUNTRY`, `HOTEL`, `AIRPORT`, `DISTRICT`, `LANDMARK` |
230
+ | `checkin` | `YYYY-MM-DD` | `2026-05-01` | |
231
+ | `checkout` | `YYYY-MM-DD` | `2026-05-03` | |
232
+ | `group_adults` | int | `2` | |
233
+ | `no_rooms` | int | `1` | |
234
+ | `group_children` | int | `0` | |
235
+ | `age` | int (repeatable) | `5` | Child age; one per child |
236
+ | `selected_currency` | string | `USD` | ISO 4217 currency code |
237
+ | `lang` | string | `en-us` | BCP 47 locale |
238
+ | `nflt` | string | `ht_id=204;class=4` | Semicolon-separated filters |
239
+ | `order` | string | `price` | Sort: `price`, `class`, `review_score`, `distance`, `upsort_bh` |
240
+ | `offset` | int | `25` | Pagination (0-based, step 25) |
241
+ | `rows` | int | `25` | Results per page (max 25) |
242
+ | `map` | `1` | `1` | Map view mode |
243
+ | `src` | string | `searchresults` | Source context (cosmetic) |
244
+
245
+ **Common `nflt` filter codes:**
246
+ - `ht_id=204` — Hotels only
247
+ - `class=3;class=4;class=5` — Star rating
248
+ - `review_score=90` — Guest rating ≥ 9.0
249
+ - `fc=2` — Free cancellation
250
+ - `rm_types=…` — Room type
251
+ - `pri=1;pri=2` — Price tier (budget / mid / upscale)
252
+
253
+ ### Property Pages
254
+ `https://www.booking.com/hotel/{country_code}/{hotel_slug}.html`
255
+
256
+ Confirmed from sitemap (74 shards, ~3.5M properties):
257
+ ```
258
+ https://www.booking.com/hotel/{cc}/{slug}.html
259
+ https://www.booking.com/hotel/{cc}/{slug}.en-gb.html
260
+ https://www.booking.com/hotel/{cc}/{slug}.{lang}.html
261
+ ```
262
+ - `cc` = 2-letter ISO country code (e.g., `fr`, `us`, `gb`, `de`, `jp`)
263
+ - `slug` = hotel name, lowercase, hyphen-separated
264
+ - Locale suffix optional; omit for default (English)
265
+
266
+ ### City / Region / Country Pages
267
+ ```
268
+ https://www.booking.com/city/{cc}/{city-slug}.html
269
+ https://www.booking.com/region/{cc}/{region-slug}.html
270
+ https://www.booking.com/country/{cc}.html
271
+ ```
272
+
273
+ ---
274
+
275
+ ## Browser-Based Extraction (Required for All Data)
276
+
277
+ Since `http_get` is blocked, all actual data extraction requires the browser
278
+ (`goto` + `js`). The WAF challenge resolves automatically in Chrome.
279
+
280
+ ### Initial Navigation
281
+
282
+ ```python
283
+ # Always use new_tab() for the first Booking.com load in a session
284
+ tid = new_tab("https://www.booking.com/searchresults.html?ss=Paris&checkin=2026-05-01&checkout=2026-05-03&group_adults=2&no_rooms=1&selected_currency=USD")
285
+ wait_for_load()
286
+ wait(3) # React hydration takes ~3s after readyState=complete
287
+
288
+ # Check for WAF challenge still running (rare in real Chrome)
289
+ url = page_info()["url"]
290
+ if "chal_t=" in url:
291
+ wait(5) # WAF challenge resolving
292
+ wait_for_load()
293
+ ```
294
+
295
+ ### GDPR / Cookie Consent Banner (EU Visitors)
296
+
297
+ Shown to visitors with EU IP addresses or EU `Accept-Language` headers **after**
298
+ the WAF challenge resolves. It blocks interaction until dismissed.
299
+
300
+ ```python
301
+ def dismiss_cookie_banner():
302
+ # Booking.com uses data-testid="accept" on the Accept button
303
+ accepted = js("""
304
+ (function() {
305
+ var btn = document.querySelector('[data-testid="accept"]')
306
+ || document.querySelector('#onetrust-accept-btn-handler')
307
+ || document.querySelector('[aria-label*="Accept"]');
308
+ if (btn) { btn.click(); return true; }
309
+ return false;
310
+ })()
311
+ """)
312
+ return accepted
313
+
314
+ # Call immediately after load if you have an EU IP
315
+ if dismiss_cookie_banner():
316
+ wait(1)
317
+ ```
318
+
319
+ The consent banner does **not** appear in the WAF stub — it only renders after
320
+ the full React app loads. Non-EU visitors (US IP, `Accept-Language: en-US`)
321
+ may not see it at all.
322
+
323
+ ### Search Results Page Extraction
324
+
325
+ ```python
326
+ results = js("""
327
+ Array.from(document.querySelectorAll('[data-testid="property-card"]')).map(el => ({
328
+ name: el.querySelector('[data-testid="title"]')?.innerText?.trim(),
329
+ url: el.querySelector('[data-testid="title-link"]')?.href,
330
+ price: el.querySelector('[data-testid="price-and-discounted-price"]')?.innerText?.trim(),
331
+ rating: el.querySelector('[data-testid="review-score"]')?.innerText?.trim(),
332
+ stars: el.querySelectorAll('[data-testid="rating-stars"] svg').length,
333
+ location: el.querySelector('[data-testid="address"]')?.innerText?.trim(),
334
+ availability_note: el.querySelector('[data-testid="availability-rate-information"]')?.innerText?.trim(),
335
+ is_genius: !!el.querySelector('[data-testid="genius-label"]'),
336
+ }))
337
+ """)
338
+ ```
339
+
340
+ **Field notes:**
341
+ - `data-testid="property-card"` — confirmed selector for result cards (as of
342
+ 2025-2026; Booking migrated from `sr-hotel` class to data-testid attributes).
343
+ - `data-testid="price-and-discounted-price"` — contains the nightly rate;
344
+ may show original + discounted price together as text.
345
+ - `data-testid="review-score"` — contains both the numeric score (e.g.,
346
+ `"9.2"`) and the label (e.g., `"Superb"`); use `.split('\n')[0]` for score.
347
+ - `data-testid="rating-stars"` — star rating icons; count SVG children for
348
+ star count.
349
+ - Results are loaded asynchronously; 3s wait after `wait_for_load()` is
350
+ required for all cards to render.
351
+
352
+ ### Pagination
353
+
354
+ ```python
355
+ # Method 1: Next page button
356
+ next_btn = js("document.querySelector('[data-testid=\"pagination-next\"]')?.href")
357
+ if next_btn:
358
+ goto_url(next_btn)
359
+ wait_for_load()
360
+ wait(3)
361
+
362
+ # Method 2: Offset parameter (25 results per page)
363
+ current_url = page_info()["url"]
364
+ offset = 25 # next page
365
+ goto_url(current_url + f"&offset={offset}")
366
+ wait_for_load()
367
+ wait(3)
368
+ ```
369
+
370
+ ### Property / Hotel Page Extraction
371
+
372
+ ```python
373
+ detail = js("""
374
+ ({
375
+ name: document.querySelector('[data-testid="property-name"]')?.innerText?.trim()
376
+ || document.querySelector('h2.hp__hotel-name, h1.pp-hotel-name-title')?.innerText?.trim(),
377
+ rating: document.querySelector('[data-testid="rating-squares"]')
378
+ ? document.querySelectorAll('[data-testid="rating-squares"] svg').length
379
+ : null,
380
+ score: document.querySelector('[data-testid="review-score-right-component"] .ac4a7896c7')?.innerText
381
+ || document.querySelector('[aria-label*="Scored"]')?.getAttribute('aria-label'),
382
+ address: document.querySelector('[data-testid="PropertyHeaderAddressDesktop"]')?.innerText?.trim()
383
+ || document.querySelector('[id="hotel_address"]')?.innerText?.trim(),
384
+ description: document.querySelector('[data-testid="property-description-content"]')?.innerText?.trim()
385
+ || document.querySelector('#property_description_content')?.innerText?.trim(),
386
+ amenities: Array.from(document.querySelectorAll('[data-testid="facility-list-item"]'))
387
+ .map(e => e.innerText?.trim()).filter(Boolean),
388
+ room_types: Array.from(document.querySelectorAll('[data-testid="roomstable-accordion"]'))
389
+ .map(el => ({
390
+ name: el.querySelector('[data-testid="room-type-name"]')?.innerText?.trim(),
391
+ price: el.querySelector('[data-testid="price-and-discounted-price"]')?.innerText?.trim(),
392
+ })),
393
+ lat: document.querySelector('a[href*="maps.google"]')
394
+ ?.href?.match(/[?&]q=([^&]+)/)?.[1]?.split(',')[0],
395
+ lon: document.querySelector('a[href*="maps.google"]')
396
+ ?.href?.match(/[?&]q=([^&]+)/)?.[1]?.split(',')[1],
397
+ })
398
+ """)
399
+ ```
400
+
401
+ ### JSON-LD Schema (Property Pages)
402
+
403
+ Property pages embed JSON-LD when fully rendered in browser. The schema type
404
+ is `Hotel`:
405
+
406
+ ```python
407
+ ld_json = js("""
408
+ (function() {
409
+ for (var s of document.querySelectorAll('script[type="application/ld+json"]')) {
410
+ try {
411
+ var d = JSON.parse(s.textContent);
412
+ if (d['@type'] === 'Hotel' || d['@type'] === 'LodgingBusiness') return d;
413
+ } catch(e) {}
414
+ }
415
+ return null;
416
+ })()
417
+ """)
418
+ # Returns:
419
+ # {
420
+ # "@type": "Hotel",
421
+ # "name": "Hotel de Crillon",
422
+ # "aggregateRating": {"ratingValue": "9.2", "reviewCount": "1423"},
423
+ # "address": {"streetAddress": "10 Place de la Concorde", "addressLocality": "Paris", ...},
424
+ # "geo": {"latitude": 48.865, "longitude": 2.321},
425
+ # "starRating": {"ratingValue": 5}
426
+ # }
427
+ ```
428
+
429
+ JSON-LD is **not present in the WAF stub** — it only exists in the fully
430
+ rendered page. `http_get` will never see it.
431
+
432
+ ### Embedded JavaScript Data (`__NEXT_DATA__` / `b_hotel_data`)
433
+
434
+ Booking.com's React app may embed search state in `window.__NEXT_DATA__` or
435
+ legacy `b_hotel_data` globals. Access via:
436
+
437
+ ```python
438
+ next_data = js("window.__NEXT_DATA__") # dict or None
439
+ b_hotel = js("window.b_hotel_data") # dict or None — legacy pages
440
+ ```
441
+
442
+ These globals are not present in the WAF stub and their availability depends
443
+ on page version. Prefer data-testid selectors which are more stable.
444
+
445
+ ---
446
+
447
+ ## Pricing Extraction Patterns
448
+
449
+ Booking.com shows prices per night with multiple formatting variants:
450
+
451
+ ```python
452
+ price_patterns = js("""
453
+ ({
454
+ // Search results card price
455
+ search_price: document.querySelector('[data-testid="price-and-discounted-price"]')?.innerText,
456
+ // Property page room price
457
+ room_price: document.querySelector('[data-testid="price-and-discounted-price"]')?.innerText,
458
+ // Original (crossed-out) price before discount
459
+ original_price: document.querySelector('[data-testid="recommended-units-price"] s')?.innerText
460
+ || document.querySelector('.prco-valign-middle-helper del')?.innerText,
461
+ // "Price for X nights" summary
462
+ total_price: document.querySelector('[data-testid="checkout-price-summary"]')?.innerText,
463
+ // Genius discount tag
464
+ genius_discount: document.querySelector('[data-testid="genius-rate-badge"]')?.innerText,
465
+ })
466
+ """)
467
+ ```
468
+
469
+ **Price display nuances:**
470
+ - Prices shown are **per night** by default; multiply by nights for total.
471
+ - Currency is controlled by `selected_currency` URL param or user account
472
+ setting.
473
+ - Taxes/fees may or may not be included; look for `"Includes taxes and fees"`
474
+ or `"+ taxes & fees"` text adjacent to the price element.
475
+ - The `data-testid="price-and-discounted-price"` element returns a single
476
+ string that may contain both original and discounted price
477
+ (e.g., `"US$400\nUS$320"`).
478
+
479
+ ---
480
+
481
+ ## WAF Detection & Handling in Browser
482
+
483
+ The WAF resolves automatically in a real Chrome session. To detect if
484
+ something went wrong:
485
+
486
+ ```python
487
+ def check_booking_waf():
488
+ url = page_info()["url"]
489
+ html_snippet = js("document.body?.innerHTML?.slice(0, 500)") or ""
490
+ return (
491
+ "chal_t=" in url
492
+ or "AwsWafIntegration" in html_snippet
493
+ or "challenge-container" in html_snippet
494
+ )
495
+
496
+ def wait_past_waf(timeout=15):
497
+ import time
498
+ deadline = time.time() + timeout
499
+ while time.time() < deadline:
500
+ if not check_booking_waf():
501
+ return True
502
+ wait(1)
503
+ return False # timed out — WAF didn't resolve
504
+
505
+ # Use after goto_url():
506
+ goto_url("https://www.booking.com/searchresults.html?ss=London&checkin=2026-06-01&checkout=2026-06-03&group_adults=2&no_rooms=1")
507
+ wait_for_load()
508
+ wait_past_waf()
509
+ wait(2) # hydration
510
+ ```
511
+
512
+ ---
513
+
514
+ ## Sitemap-Based URL Discovery Workflow
515
+
516
+ Use this when you need a list of property URLs for a given country or city,
517
+ without needing to scrape search results pages in the browser:
518
+
519
+ ```python
520
+ import gzip, re, urllib.request
521
+
522
+ GOOGLEBOT = {"User-Agent": "Googlebot/2.1 (+http://www.google.com/bot.html)"}
523
+
524
+ def get_hotel_urls_for_country(cc: str, lang: str = "en-gb", max_shards: int = 2) -> list[str]:
525
+ """Returns property page URLs for a country from sitemaps. No browser needed."""
526
+ idx_url = f"https://www.booking.com/sitembk-hotel-index.xml"
527
+ idx = http_get(idx_url, headers=GOOGLEBOT)
528
+ pattern = rf'<loc>(https://www\.booking\.com/sitembk-hotel-{lang}\.\d+\.xml\.gz)</loc>'
529
+ shards = re.findall(pattern, idx)[:max_shards]
530
+
531
+ urls = []
532
+ for shard_url in shards:
533
+ req = urllib.request.Request(shard_url, headers=GOOGLEBOT)
534
+ with urllib.request.urlopen(req, timeout=60) as r:
535
+ xml = gzip.decompress(r.read()).decode()
536
+ all_urls = re.findall(r'<loc>(https://[^<]+)</loc>', xml)
537
+ # Filter by country code
538
+ country_urls = [u for u in all_urls if f"/hotel/{cc}/" in u]
539
+ urls.extend(country_urls)
540
+ return urls
541
+
542
+ # Example: get French hotel URLs (no browser needed, instant)
543
+ # french_hotels = get_hotel_urls_for_country("fr", max_shards=1)
544
+ # len(french_hotels) -> ~8,000+ URLs from one shard
545
+ ```
546
+
547
+ ---
548
+
549
+ ## Gotchas
550
+
551
+ - **WAF blocks everything via `http_get`** — there is no User-Agent or header
552
+ combination that bypasses it. The challenge is cryptographic, not heuristic.
553
+ - **WAF has two page sizes** — ~3,962 bytes (newer SDK, no AJAX reporter) and
554
+ ~8,410 bytes (older with error reporting). Both are equally blocked.
555
+ - **Sitemaps whitelist Googlebot UA** — `Googlebot/2.1` UA works for sitemap
556
+ XML/GZ files but NOT for hotel/city/search HTML pages.
557
+ - **GraphQL endpoint is unprotected** but useless without a valid Booking.com
558
+ session (irene service requires authentication for all substantive queries).
559
+ - **GraphQL op-name whitelist**: introspection (`__schema`) is blocked by
560
+ operation name restriction. Use field validation errors to probe the schema.
561
+ - **GDPR consent banner**: shown after WAF resolves, before React renders
562
+ search results. Must be dismissed (click `[data-testid="accept"]`) before
563
+ interacting with EU sessions. Non-EU IPs may not see it.
564
+ - **React hydration delay**: `wait_for_load()` fires before card data renders.
565
+ Always add 2-3s of `wait()` after `wait_for_load()`.
566
+ - **`sr-hotel` class is legacy** — Booking.com migrated to data-testid
567
+ attributes. Use `[data-testid="property-card"]`, not `.sr-hotel`.
568
+ - **Price parsing**: the price element often contains the full string
569
+ `"US$400\nUS$320"` when a discount applies. Split on `\n` and take the last
570
+ item for current price.
571
+ - **Offset pagination cap**: Booking caps results at 1,000 properties per
572
+ search (offset 0–975, rows=25). For cities with >1,000 properties, use
573
+ filters (`nflt`) to segment results.
574
+ - **Currency must be set via URL param**: `selected_currency=USD` in the search
575
+ URL; the cookie-based currency selection may not persist across navigation.
576
+ - **`dest_id` for cities**: Paris = `-1456928`, Amsterdam = `-2140479`,
577
+ London = `-2601889`. Negative integers indicate city-level destinations.
578
+ Get the ID by reading it from the URL after using `ss=` search.