@pencil-agent/nano-pencil 2.0.0 → 2.0.1

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
Files changed (195) hide show
  1. package/README.md +267 -267
  2. package/dist/build-meta.json +3 -3
  3. package/dist/core/export-html/AGENT.md +11 -11
  4. package/dist/core/export-html/template.css +971 -971
  5. package/dist/core/export-html/template.html +54 -54
  6. package/dist/core/mcp/mcp-client.d.ts +3 -1
  7. package/dist/core/mcp/mcp-client.js +6 -6
  8. package/dist/core/mcp/mcp-config.d.ts +3 -3
  9. package/dist/core/mcp/mcp-config.js +1 -1
  10. package/dist/core/mcp/mcp-manager.d.ts +5 -1
  11. package/dist/core/mcp/mcp-manager.js +1 -1
  12. package/dist/core/platform/config/resource-loader.d.ts +2 -0
  13. package/dist/core/platform/config/resource-loader.js +2 -2
  14. package/dist/core/runtime/agent-session.d.ts +12 -0
  15. package/dist/core/runtime/agent-session.js +8 -8
  16. package/dist/core/runtime/sdk.d.ts +8 -0
  17. package/dist/core/runtime/sdk.js +1 -1
  18. package/dist/extensions/builtin/AGENT.md +115 -115
  19. package/dist/extensions/builtin/browser/AGENT.md +17 -17
  20. package/dist/extensions/builtin/browser/agent-workspace/agent_helpers.py +12 -12
  21. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/amazon/product-search.md +198 -198
  22. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/archive-org/scraping.md +341 -341
  23. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/arxiv/scraping.md +311 -311
  24. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/arxiv-bulk/scraping.md +333 -333
  25. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/atlas/overview.md +70 -70
  26. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/booking-com/scraping.md +578 -578
  27. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/capterra/scraping.md +440 -440
  28. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/centilebrain/generate-estimates.md +110 -110
  29. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/coingecko/scraping.md +325 -325
  30. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/coinmarketcap/scraping.md +463 -463
  31. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/coursera/scraping.md +360 -360
  32. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/craigslist/scraping.md +390 -390
  33. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/crossref/scraping.md +568 -568
  34. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/dev-to/scraping.md +323 -323
  35. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/duckduckgo/scraping.md +349 -349
  36. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/ebay/scraping.md +435 -435
  37. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/etsy/scraping.md +506 -506
  38. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/eventbrite/scraping.md +363 -363
  39. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/expedia/automation.md +168 -168
  40. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/facebook/groups.md +236 -236
  41. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/facebook/pages.md +295 -295
  42. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/framer/editor.md +108 -108
  43. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/fred/scraping.md +493 -493
  44. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/g2/scraping.md +580 -580
  45. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/genius/scraping.md +511 -511
  46. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/github/repo-actions.md +65 -65
  47. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/github/scraping.md +184 -184
  48. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/glassdoor/scraping.md +543 -543
  49. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/gmail/compose.md +122 -122
  50. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/goodreads/scraping.md +461 -461
  51. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/gutenberg/scraping.md +383 -383
  52. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/hackernews/scraping.md +243 -243
  53. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/howlongtobeat/scraping.md +473 -473
  54. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/imdb/scraping.md +271 -271
  55. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/itch-io/scraping.md +436 -436
  56. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/job-boards/indeed-glassdoor.md +1021 -1021
  57. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/letterboxd/scraping.md +349 -349
  58. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/linkedin/invitation-manager.md +109 -109
  59. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/loom/folder-enumeration.md +170 -170
  60. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/macrotrends/scraping.md +537 -537
  61. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/medium/article-hydration.md +120 -120
  62. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/medium/scraping.md +414 -414
  63. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/metacritic/scraping.md +477 -477
  64. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/musicbrainz/scraping.md +478 -478
  65. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/nasa/scraping.md +339 -339
  66. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/news-aggregation/multi-source.md +205 -205
  67. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/open-library/scraping.md +472 -472
  68. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/openalex/scraping.md +470 -470
  69. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/openstreetmap/scraping.md +490 -490
  70. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/package-registries/npm-pypi.md +478 -478
  71. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/polymarket/scraping.md +234 -234
  72. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/producthunt/scraping.md +307 -307
  73. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/pubmed/scraping.md +421 -421
  74. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/quora/scraping.md +364 -364
  75. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/rawg/scraping.md +352 -352
  76. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/reddit/scraping.md +124 -124
  77. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/rest-countries/scraping.md +233 -233
  78. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/sec-edgar/scraping.md +361 -361
  79. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/shopify-admin/README.md +36 -36
  80. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/shopify-admin/embedded-apps.md +72 -72
  81. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/shopify-admin/knowledge-base.md +109 -109
  82. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/shopify-admin/polaris-inputs.md +137 -137
  83. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/soundcloud/scraping.md +362 -362
  84. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/spotify/scraping.md +339 -339
  85. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/stackoverflow/scraping.md +435 -435
  86. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/steam/scraping.md +575 -575
  87. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/substack/scraping.md +338 -338
  88. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/thetechgeeks/pricing.md +52 -52
  89. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/tiktok/upload.md +107 -107
  90. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/tradingview/scraping.md +309 -309
  91. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/trello/boards-and-lists.md +88 -88
  92. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/trustpilot/scraping.md +375 -375
  93. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/walmart/scraping.md +444 -444
  94. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/wayback-machine/scraping.md +306 -306
  95. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/weather/scraping.md +398 -398
  96. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/wellfound/scraping.md +596 -596
  97. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/world-bank/scraping.md +356 -356
  98. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/xiaohongshu/scraping.md +84 -84
  99. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/youtube/scraping.md +418 -418
  100. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/zillow/scraping.md +433 -433
  101. package/dist/extensions/builtin/browser/browser.md +73 -73
  102. package/dist/extensions/builtin/browser/install.md +142 -142
  103. package/dist/extensions/builtin/browser/interaction-skills/connection.md +48 -48
  104. package/dist/extensions/builtin/browser/interaction-skills/cookies.md +3 -3
  105. package/dist/extensions/builtin/browser/interaction-skills/cross-origin-iframes.md +3 -3
  106. package/dist/extensions/builtin/browser/interaction-skills/dialogs.md +64 -64
  107. package/dist/extensions/builtin/browser/interaction-skills/downloads.md +3 -3
  108. package/dist/extensions/builtin/browser/interaction-skills/drag-and-drop.md +3 -3
  109. package/dist/extensions/builtin/browser/interaction-skills/dropdowns.md +3 -3
  110. package/dist/extensions/builtin/browser/interaction-skills/iframes.md +3 -3
  111. package/dist/extensions/builtin/browser/interaction-skills/network-requests.md +3 -3
  112. package/dist/extensions/builtin/browser/interaction-skills/print-as-pdf.md +3 -3
  113. package/dist/extensions/builtin/browser/interaction-skills/profile-sync.md +90 -90
  114. package/dist/extensions/builtin/browser/interaction-skills/screenshots.md +17 -17
  115. package/dist/extensions/builtin/browser/interaction-skills/scrolling.md +3 -3
  116. package/dist/extensions/builtin/browser/interaction-skills/shadow-dom.md +3 -3
  117. package/dist/extensions/builtin/browser/interaction-skills/tabs.md +69 -69
  118. package/dist/extensions/builtin/browser/interaction-skills/uploads.md +1 -1
  119. package/dist/extensions/builtin/browser/interaction-skills/viewport.md +3 -3
  120. package/dist/extensions/builtin/browser/src/browser_harness/AGENT.md +15 -15
  121. package/dist/extensions/builtin/browser/src/browser_harness/__init__.py +8 -8
  122. package/dist/extensions/builtin/browser/src/browser_harness/_ipc.py +90 -90
  123. package/dist/extensions/builtin/browser/src/browser_harness/admin.py +722 -722
  124. package/dist/extensions/builtin/browser/src/browser_harness/daemon.py +328 -328
  125. package/dist/extensions/builtin/browser/src/browser_harness/helpers.py +396 -396
  126. package/dist/extensions/builtin/browser/src/browser_harness/run.py +103 -103
  127. package/dist/extensions/builtin/discipline/skills/brainstorming/SKILL.md +33 -33
  128. package/dist/extensions/builtin/discipline/skills/executing-plans/SKILL.md +25 -25
  129. package/dist/extensions/builtin/discipline/skills/finishing-development-branch/SKILL.md +25 -25
  130. package/dist/extensions/builtin/discipline/skills/receiving-code-review/SKILL.md +22 -22
  131. package/dist/extensions/builtin/discipline/skills/requesting-code-review/SKILL.md +31 -31
  132. package/dist/extensions/builtin/discipline/skills/systematic-debugging/SKILL.md +28 -28
  133. package/dist/extensions/builtin/discipline/skills/test-driven-development/SKILL.md +32 -32
  134. package/dist/extensions/builtin/discipline/skills/using-git-worktrees/SKILL.md +25 -25
  135. package/dist/extensions/builtin/discipline/skills/verification-before-completion/SKILL.md +27 -27
  136. package/dist/extensions/builtin/discipline/skills/writing-plans/SKILL.md +26 -26
  137. package/dist/extensions/builtin/goal/README.md +67 -67
  138. package/dist/extensions/builtin/grub/README.md +112 -112
  139. package/dist/extensions/builtin/link-world/agent-workspace/README.md +16 -16
  140. package/dist/extensions/builtin/link-world/internet-search/internet-search.md +65 -65
  141. package/dist/extensions/builtin/link-world/link-world-agent.md +82 -82
  142. package/dist/extensions/builtin/link-world/linkworld.md +313 -313
  143. package/dist/extensions/builtin/link-world/network-routing/network-routing.md +67 -67
  144. package/dist/extensions/builtin/loop/README.md +92 -92
  145. package/dist/extensions/builtin/mcp/figma-design.md +68 -68
  146. package/dist/extensions/builtin/mcp/mcp-management.md +85 -85
  147. package/dist/extensions/builtin/recap/AGENT.md +15 -15
  148. package/dist/extensions/builtin/sal/README.md +72 -72
  149. package/dist/extensions/builtin/security-audit/README.md +289 -289
  150. package/dist/extensions/builtin/team/AGENT.md +112 -112
  151. package/dist/extensions/builtin/team/TESTING.md +299 -299
  152. package/dist/extensions/builtin/token-save/README.md +56 -56
  153. package/dist/extensions/optional/AGENT.md +10 -10
  154. package/dist/modes/interactive/interactive-mode.js +36 -36
  155. package/dist/modes/interactive/theme/dark.json +85 -85
  156. package/dist/modes/interactive/theme/light.json +84 -84
  157. package/dist/modes/interactive/theme/theme-schema.json +335 -335
  158. package/dist/modes/interactive/theme/warm.json +81 -81
  159. package/dist/node_modules/@pencil-agent/agent-core/dist/agent-loop.js +3 -2
  160. package/dist/node_modules/@pencil-agent/agent-core/dist/structured-adaptive-agent-loop.js +2 -1
  161. package/dist/node_modules/@pencil-agent/ai/dist/cli.js +0 -0
  162. package/docs/cc-agent-design.md +1297 -0
  163. package/docs/cc-tui-design.md +1333 -0
  164. package/docs/codex-goal-command-impl.md +1055 -1055
  165. package/docs/codex-goal-vs-grub.md +500 -500
  166. package/docs/custom-provider.md +27 -27
  167. package/docs/extensions.md +27 -27
  168. package/docs/keybindings.md +27 -27
  169. package/docs/loop /351/207/215/346/236/204/345/256/214/346/210/220/346/200/273/347/273/223.md" +250 -250
  170. package/docs/loop /351/207/215/346/236/204/345/256/214/346/210/220/346/212/245/345/221/212.md" +122 -122
  171. package/docs/loop /351/207/215/346/236/204/346/226/271/346/241/210.md" +1222 -1222
  172. package/docs/loop /351/207/215/346/236/204/346/226/271/346/241/210/345/256/236/347/216/260/346/212/245/345/221/212.md" +158 -158
  173. package/docs/loop /351/207/215/346/236/204/346/226/271/346/241/210/345/257/271/346/257/224/345/210/206/346/236/220.md" +128 -128
  174. package/docs/loop /351/207/215/346/236/204/350/256/241/345/210/222.md" +320 -320
  175. package/docs/loop-usage-examples.md +214 -214
  176. package/docs/models.md +27 -27
  177. package/docs/nanoPencil-/345/255/246/344/271/240/350/256/241/345/210/222.md +170 -0
  178. package/docs/packages.md +27 -27
  179. package/docs/pi-design-philosophy.md +457 -457
  180. package/docs/planmode.md +1987 -1987
  181. package/docs/prompt-templates.md +27 -27
  182. package/docs/providers.md +27 -27
  183. package/docs/scan-report.md +3820 -0
  184. package/docs/sdk.md +27 -27
  185. package/docs/skills.md +27 -27
  186. package/docs/themes.md +27 -27
  187. package/docs/tui.md +27 -27
  188. package/docs//345/257/271/346/240/207Claude-Code.md +1775 -0
  189. package/docs//351/230/277/351/207/214/345/267/264/345/267/264/350/264/242/346/212/245/345/210/206/346/236/220/344/271/246.md +261 -0
  190. package/package.json +190 -190
  191. package/docs/ACP/345/215/217/350/256/256/351/233/206/346/210/220/345/274/200/345/217/221/346/226/207/346/241/243.md +0 -851
  192. package/docs/SDK-TESTING.md +0 -364
  193. package/docs/mem-core/346/212/200/346/234/257/346/226/207/346/241/243.md +0 -593
  194. package/docs/startup-performance-optimization.md +0 -301
  195. package/docs//350/256/244/347/237/245/345/234/260/345/233/276.md +0 -47
@@ -1,578 +1,578 @@
1
- # Booking.com — Scraping & Data Extraction
2
-
3
- Field-tested against booking.com on 2026-04-18 using `http_get` and the
4
- `dml/graphql` JSON API. All tests run without a browser session.
5
-
6
- ---
7
-
8
- ## TL;DR
9
-
10
- **`http_get` returns nothing useful from booking.com.** Every HTML page —
11
- search results, hotel pages, city pages, the homepage — is intercepted by an
12
- AWS WAF JS challenge before any content is served. The challenge requires
13
- JavaScript execution to complete a cryptographic puzzle and set an
14
- `aws-waf-token` cookie. Without a real browser, you get a ~4-8 KB stub page.
15
-
16
- **What you can do without a browser:**
17
- - Enumerate hotel/city/region URLs from XML sitemaps (Googlebot UA required).
18
- - Read `robots.txt` for URL pattern documentation.
19
- - Query the GraphQL endpoint `https://www.booking.com/dml/graphql` for schema
20
- exploration (no auth = internal errors, but validation errors reveal the
21
- schema).
22
-
23
- **For all actual data extraction, use the browser (`goto` + `js`).**
24
-
25
- ---
26
-
27
- ## AWS WAF JS Challenge — What It Is
28
-
29
- Every `http_get` request to `www.booking.com` receives one of two variants of
30
- a WAF stub:
31
-
32
- **Variant A (~3,962 bytes) — modern SDK:**
33
- ```html
34
- <script src="https://www.booking.com/__challenge_{KEY}/{HASH}/challenge.js"></script>
35
- <script>
36
- AwsWafIntegration.getToken().then(() => { window.location.href = newHref; });
37
- </script>
38
- ```
39
-
40
- **Variant B (~8,410 bytes) — with AJAX error reporting:**
41
- Same AWS WAF SDK, plus an `XMLHttpRequest`-based error reporter that POSTs to
42
- `https://reports.booking.com/chal_report`. This variant is more common on
43
- non-browser UA strings.
44
-
45
- **Detection in your code:**
46
- ```python
47
- def is_waf_blocked(html: str) -> bool:
48
- return (
49
- 'AwsWafIntegration' in html
50
- or 'awsWafCookieDomainList' in html
51
- or 'challenge.js' in html
52
- or len(html) < 10_000 and '<title></title>' in html
53
- )
54
- ```
55
-
56
- **What the challenge does:**
57
- 1. Loads a 1.3 MB obfuscated JS file (`challenge.js`) from a path-keyed URL.
58
- 2. Executes a cryptographic proof-of-work puzzle client-side.
59
- 3. Sets an `aws-waf-token` cookie on the `booking.com` domain.
60
- 4. Redirects to the original URL with `?chal_t={timestamp}&force_referer=`
61
- appended.
62
-
63
- This challenge **cannot be solved by `http_get`**. It requires a real JS
64
- engine. A `bkng` session cookie is set on the first blocked response, but it
65
- has no value without the WAF token.
66
-
67
- **User agents tested — all blocked:**
68
- - Chrome desktop (`Mozilla/5.0 ... Chrome/120`)
69
- - iPhone/Safari mobile
70
- - `Googlebot/2.1` (HTML pages only; sitemaps are whitelisted)
71
- - Default `urllib` UA
72
-
73
- ---
74
-
75
- ## What `http_get` CAN Access
76
-
77
- ### 1. XML Sitemaps (URL discovery)
78
-
79
- Booking.com whitelists sitemap paths for Googlebot. This lets you enumerate
80
- millions of property, city, region, and attraction URLs without a browser.
81
-
82
- ```python
83
- import gzip, re, urllib.request
84
-
85
- GOOGLEBOT = {"User-Agent": "Googlebot/2.1 (+http://www.google.com/bot.html)"}
86
-
87
- def fetch_sitemap_index(url: str) -> list[str]:
88
- """Returns list of child sitemap URLs from an index sitemap."""
89
- xml = http_get(url, headers=GOOGLEBOT)
90
- return re.findall(r'<loc>(https://[^<]+)</loc>', xml)
91
-
92
- def fetch_sitemap_gz(gz_url: str) -> list[str]:
93
- """Decompresses a gzipped sitemap and returns all <loc> URLs."""
94
- req = urllib.request.Request(gz_url, headers=GOOGLEBOT)
95
- with urllib.request.urlopen(req, timeout=30) as r:
96
- data = gzip.decompress(r.read())
97
- return re.findall(r'<loc>(https://[^<]+)</loc>', data.decode())
98
-
99
- # Example: get all en-gb hotel URLs
100
- hotel_idx = http_get(
101
- "https://www.booking.com/sitembk-hotel-index.xml",
102
- headers=GOOGLEBOT
103
- )
104
- # 74 shards for en-gb; each shard has ~45,000-50,000 property URLs
105
- en_gb_shards = re.findall(
106
- r'<loc>(https://www\.booking\.com/sitembk-hotel-en-gb\.\d+\.xml\.gz)</loc>',
107
- hotel_idx
108
- )
109
- # hotel_urls = fetch_sitemap_gz(en_gb_shards[0]) # ~50K URLs per shard
110
- ```
111
-
112
- **Available sitemap categories (confirmed, 275 total):**
113
-
114
- | Index URL | Content |
115
- |-----------|---------|
116
- | `sitembk-hotel-index.xml` | All properties (~74 en-gb shards, ~3.5M URLs) |
117
- | `sitembk-city-index.xml` | City landing pages (~6 en-gb shards, ~44K cities) |
118
- | `sitembk-region-index.xml` | Region landing pages |
119
- | `sitembk-country-index.xml` | Country landing pages |
120
- | `sitembk-attractions-index.xml` | Attractions |
121
- | `sitembk-hotel-review-index.xml` | Review pages |
122
- | `sitembk-themed-city-{type}-index.xml` | Category-specific city pages (70+ types: hostels, luxury, spa, ski, etc.) |
123
-
124
- ### 2. `robots.txt`
125
-
126
- ```python
127
- robots = http_get("https://www.booking.com/robots.txt", headers={"User-Agent": "Mozilla/5.0"})
128
- ```
129
-
130
- - Returns immediately, no WAF.
131
- - 136 Disallow entries, 275 Sitemap declarations.
132
- - Documents all URL structures (search results, hotel pages, booking flow, etc.).
133
-
134
- ### 3. GraphQL Schema Exploration (no auth)
135
-
136
- The endpoint `https://www.booking.com/dml/graphql` is **not WAF-protected**.
137
- It accepts POST requests and returns JSON. Without a session, most queries
138
- return `Internal Server Error` from the backend (`irene` service), but
139
- **GraphQL validation errors fire before the backend** and reveal the schema.
140
-
141
- ```python
142
- import json, urllib.request, gzip
143
-
144
- GQL_URL = "https://www.booking.com/dml/graphql?lang=en-gb"
145
- GQL_HEADERS = {
146
- "User-Agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36",
147
- "Accept": "application/json",
148
- "Content-Type": "application/json",
149
- "Origin": "https://www.booking.com",
150
- "Referer": "https://www.booking.com/searchresults.html",
151
- "x-booking-context-action-name": "searchresults",
152
- "x-booking-context-aid": "376510",
153
- "x-booking-site-type-id": "1",
154
- }
155
-
156
- def gql(operation_name: str, query: str, variables: dict = None) -> dict:
157
- payload = {"operationName": operation_name, "query": query}
158
- if variables:
159
- payload["variables"] = variables
160
- req = urllib.request.Request(
161
- GQL_URL,
162
- data=json.dumps(payload).encode(),
163
- headers=GQL_HEADERS,
164
- method="POST"
165
- )
166
- with urllib.request.urlopen(req, timeout=20) as r:
167
- data = r.read()
168
- if r.headers.get("Content-Encoding") == "gzip":
169
- data = gzip.decompress(data)
170
- return json.loads(data.decode())
171
- ```
172
-
173
- **Confirmed Query type fields (schema, field-tested 2026-04-18):**
174
-
175
- | Field | Input type | Notes |
176
- |-------|-----------|-------|
177
- | `searchQueries` | none | Root for hotel search; nested `.search(SearchQueryInput!)` |
178
- | `searchBox` | `SearchBoxInput!` | Destination autocomplete / search form state |
179
- | `searchProperties` | `SearchInput!` | Returns 500 without auth session |
180
- | `propertyDetails` | `PropertyDetailsQueryInput!` | Returns 500 without auth session |
181
- | `popularDestinations` | `PopularDestinationsInput!` | Returns validation error (type mismatch) |
182
-
183
- **Important:** Booking.com GraphQL uses an **operation name whitelist** for
184
- some operations. If you get `GRAPHQL_UNKNOWN_OPERATION_NAME`, try any of the
185
- following confirmed working names: `SearchResultsPage`, `SearchQuery`,
186
- `HotelCardsList`, `SearchResultsList`, `PropertySearch`, `BookingSearch`.
187
-
188
- **Operation names that bypass the whitelist restriction** (all return
189
- `{ data: { __typename: 'Query' } }` with `{ __typename }`):
190
- - `SearchResultsPage` ✓ (confirmed, use this)
191
-
192
- **The search query structure** (known but returns 500 without session):
193
- ```graphql
194
- query SearchResultsPage($input: SearchQueryInput!) {
195
- searchQueries {
196
- search(input: $input) {
197
- __typename # Returns SearchQueryResult type
198
- }
199
- }
200
- }
201
- ```
202
-
203
- With `SearchQueryInput` fields (inferred from URL parameters, confirmed
204
- accepted by validation):
205
- ```json
206
- {
207
- "dest_id": "-1456928",
208
- "dest_type": "CITY",
209
- "checkin": "2026-05-01",
210
- "checkout": "2026-05-03",
211
- "group_adults": "2",
212
- "no_rooms": "1",
213
- "group_children": "0",
214
- "selected_currency": "USD"
215
- }
216
- ```
217
-
218
- ---
219
-
220
- ## URL Parameter Reference
221
-
222
- ### Search Results
223
- `https://www.booking.com/searchresults.html`
224
-
225
- | Parameter | Type | Example | Notes |
226
- |-----------|------|---------|-------|
227
- | `ss` | string | `Paris` | Free-text: city, hotel name, address |
228
- | `dest_id` | string | `-1456928` | Numeric city/region ID (negative = city) |
229
- | `dest_type` | string | `CITY` | `CITY`, `REGION`, `COUNTRY`, `HOTEL`, `AIRPORT`, `DISTRICT`, `LANDMARK` |
230
- | `checkin` | `YYYY-MM-DD` | `2026-05-01` | |
231
- | `checkout` | `YYYY-MM-DD` | `2026-05-03` | |
232
- | `group_adults` | int | `2` | |
233
- | `no_rooms` | int | `1` | |
234
- | `group_children` | int | `0` | |
235
- | `age` | int (repeatable) | `5` | Child age; one per child |
236
- | `selected_currency` | string | `USD` | ISO 4217 currency code |
237
- | `lang` | string | `en-us` | BCP 47 locale |
238
- | `nflt` | string | `ht_id=204;class=4` | Semicolon-separated filters |
239
- | `order` | string | `price` | Sort: `price`, `class`, `review_score`, `distance`, `upsort_bh` |
240
- | `offset` | int | `25` | Pagination (0-based, step 25) |
241
- | `rows` | int | `25` | Results per page (max 25) |
242
- | `map` | `1` | `1` | Map view mode |
243
- | `src` | string | `searchresults` | Source context (cosmetic) |
244
-
245
- **Common `nflt` filter codes:**
246
- - `ht_id=204` — Hotels only
247
- - `class=3;class=4;class=5` — Star rating
248
- - `review_score=90` — Guest rating ≥ 9.0
249
- - `fc=2` — Free cancellation
250
- - `rm_types=…` — Room type
251
- - `pri=1;pri=2` — Price tier (budget / mid / upscale)
252
-
253
- ### Property Pages
254
- `https://www.booking.com/hotel/{country_code}/{hotel_slug}.html`
255
-
256
- Confirmed from sitemap (74 shards, ~3.5M properties):
257
- ```
258
- https://www.booking.com/hotel/{cc}/{slug}.html
259
- https://www.booking.com/hotel/{cc}/{slug}.en-gb.html
260
- https://www.booking.com/hotel/{cc}/{slug}.{lang}.html
261
- ```
262
- - `cc` = 2-letter ISO country code (e.g., `fr`, `us`, `gb`, `de`, `jp`)
263
- - `slug` = hotel name, lowercase, hyphen-separated
264
- - Locale suffix optional; omit for default (English)
265
-
266
- ### City / Region / Country Pages
267
- ```
268
- https://www.booking.com/city/{cc}/{city-slug}.html
269
- https://www.booking.com/region/{cc}/{region-slug}.html
270
- https://www.booking.com/country/{cc}.html
271
- ```
272
-
273
- ---
274
-
275
- ## Browser-Based Extraction (Required for All Data)
276
-
277
- Since `http_get` is blocked, all actual data extraction requires the browser
278
- (`goto` + `js`). The WAF challenge resolves automatically in Chrome.
279
-
280
- ### Initial Navigation
281
-
282
- ```python
283
- # Always use new_tab() for the first Booking.com load in a session
284
- tid = new_tab("https://www.booking.com/searchresults.html?ss=Paris&checkin=2026-05-01&checkout=2026-05-03&group_adults=2&no_rooms=1&selected_currency=USD")
285
- wait_for_load()
286
- wait(3) # React hydration takes ~3s after readyState=complete
287
-
288
- # Check for WAF challenge still running (rare in real Chrome)
289
- url = page_info()["url"]
290
- if "chal_t=" in url:
291
- wait(5) # WAF challenge resolving
292
- wait_for_load()
293
- ```
294
-
295
- ### GDPR / Cookie Consent Banner (EU Visitors)
296
-
297
- Shown to visitors with EU IP addresses or EU `Accept-Language` headers **after**
298
- the WAF challenge resolves. It blocks interaction until dismissed.
299
-
300
- ```python
301
- def dismiss_cookie_banner():
302
- # Booking.com uses data-testid="accept" on the Accept button
303
- accepted = js("""
304
- (function() {
305
- var btn = document.querySelector('[data-testid="accept"]')
306
- || document.querySelector('#onetrust-accept-btn-handler')
307
- || document.querySelector('[aria-label*="Accept"]');
308
- if (btn) { btn.click(); return true; }
309
- return false;
310
- })()
311
- """)
312
- return accepted
313
-
314
- # Call immediately after load if you have an EU IP
315
- if dismiss_cookie_banner():
316
- wait(1)
317
- ```
318
-
319
- The consent banner does **not** appear in the WAF stub — it only renders after
320
- the full React app loads. Non-EU visitors (US IP, `Accept-Language: en-US`)
321
- may not see it at all.
322
-
323
- ### Search Results Page Extraction
324
-
325
- ```python
326
- results = js("""
327
- Array.from(document.querySelectorAll('[data-testid="property-card"]')).map(el => ({
328
- name: el.querySelector('[data-testid="title"]')?.innerText?.trim(),
329
- url: el.querySelector('[data-testid="title-link"]')?.href,
330
- price: el.querySelector('[data-testid="price-and-discounted-price"]')?.innerText?.trim(),
331
- rating: el.querySelector('[data-testid="review-score"]')?.innerText?.trim(),
332
- stars: el.querySelectorAll('[data-testid="rating-stars"] svg').length,
333
- location: el.querySelector('[data-testid="address"]')?.innerText?.trim(),
334
- availability_note: el.querySelector('[data-testid="availability-rate-information"]')?.innerText?.trim(),
335
- is_genius: !!el.querySelector('[data-testid="genius-label"]'),
336
- }))
337
- """)
338
- ```
339
-
340
- **Field notes:**
341
- - `data-testid="property-card"` — confirmed selector for result cards (as of
342
- 2025-2026; Booking migrated from `sr-hotel` class to data-testid attributes).
343
- - `data-testid="price-and-discounted-price"` — contains the nightly rate;
344
- may show original + discounted price together as text.
345
- - `data-testid="review-score"` — contains both the numeric score (e.g.,
346
- `"9.2"`) and the label (e.g., `"Superb"`); use `.split('\n')[0]` for score.
347
- - `data-testid="rating-stars"` — star rating icons; count SVG children for
348
- star count.
349
- - Results are loaded asynchronously; 3s wait after `wait_for_load()` is
350
- required for all cards to render.
351
-
352
- ### Pagination
353
-
354
- ```python
355
- # Method 1: Next page button
356
- next_btn = js("document.querySelector('[data-testid=\"pagination-next\"]')?.href")
357
- if next_btn:
358
- goto_url(next_btn)
359
- wait_for_load()
360
- wait(3)
361
-
362
- # Method 2: Offset parameter (25 results per page)
363
- current_url = page_info()["url"]
364
- offset = 25 # next page
365
- goto_url(current_url + f"&offset={offset}")
366
- wait_for_load()
367
- wait(3)
368
- ```
369
-
370
- ### Property / Hotel Page Extraction
371
-
372
- ```python
373
- detail = js("""
374
- ({
375
- name: document.querySelector('[data-testid="property-name"]')?.innerText?.trim()
376
- || document.querySelector('h2.hp__hotel-name, h1.pp-hotel-name-title')?.innerText?.trim(),
377
- rating: document.querySelector('[data-testid="rating-squares"]')
378
- ? document.querySelectorAll('[data-testid="rating-squares"] svg').length
379
- : null,
380
- score: document.querySelector('[data-testid="review-score-right-component"] .ac4a7896c7')?.innerText
381
- || document.querySelector('[aria-label*="Scored"]')?.getAttribute('aria-label'),
382
- address: document.querySelector('[data-testid="PropertyHeaderAddressDesktop"]')?.innerText?.trim()
383
- || document.querySelector('[id="hotel_address"]')?.innerText?.trim(),
384
- description: document.querySelector('[data-testid="property-description-content"]')?.innerText?.trim()
385
- || document.querySelector('#property_description_content')?.innerText?.trim(),
386
- amenities: Array.from(document.querySelectorAll('[data-testid="facility-list-item"]'))
387
- .map(e => e.innerText?.trim()).filter(Boolean),
388
- room_types: Array.from(document.querySelectorAll('[data-testid="roomstable-accordion"]'))
389
- .map(el => ({
390
- name: el.querySelector('[data-testid="room-type-name"]')?.innerText?.trim(),
391
- price: el.querySelector('[data-testid="price-and-discounted-price"]')?.innerText?.trim(),
392
- })),
393
- lat: document.querySelector('a[href*="maps.google"]')
394
- ?.href?.match(/[?&]q=([^&]+)/)?.[1]?.split(',')[0],
395
- lon: document.querySelector('a[href*="maps.google"]')
396
- ?.href?.match(/[?&]q=([^&]+)/)?.[1]?.split(',')[1],
397
- })
398
- """)
399
- ```
400
-
401
- ### JSON-LD Schema (Property Pages)
402
-
403
- Property pages embed JSON-LD when fully rendered in browser. The schema type
404
- is `Hotel`:
405
-
406
- ```python
407
- ld_json = js("""
408
- (function() {
409
- for (var s of document.querySelectorAll('script[type="application/ld+json"]')) {
410
- try {
411
- var d = JSON.parse(s.textContent);
412
- if (d['@type'] === 'Hotel' || d['@type'] === 'LodgingBusiness') return d;
413
- } catch(e) {}
414
- }
415
- return null;
416
- })()
417
- """)
418
- # Returns:
419
- # {
420
- # "@type": "Hotel",
421
- # "name": "Hotel de Crillon",
422
- # "aggregateRating": {"ratingValue": "9.2", "reviewCount": "1423"},
423
- # "address": {"streetAddress": "10 Place de la Concorde", "addressLocality": "Paris", ...},
424
- # "geo": {"latitude": 48.865, "longitude": 2.321},
425
- # "starRating": {"ratingValue": 5}
426
- # }
427
- ```
428
-
429
- JSON-LD is **not present in the WAF stub** — it only exists in the fully
430
- rendered page. `http_get` will never see it.
431
-
432
- ### Embedded JavaScript Data (`__NEXT_DATA__` / `b_hotel_data`)
433
-
434
- Booking.com's React app may embed search state in `window.__NEXT_DATA__` or
435
- legacy `b_hotel_data` globals. Access via:
436
-
437
- ```python
438
- next_data = js("window.__NEXT_DATA__") # dict or None
439
- b_hotel = js("window.b_hotel_data") # dict or None — legacy pages
440
- ```
441
-
442
- These globals are not present in the WAF stub and their availability depends
443
- on page version. Prefer data-testid selectors which are more stable.
444
-
445
- ---
446
-
447
- ## Pricing Extraction Patterns
448
-
449
- Booking.com shows prices per night with multiple formatting variants:
450
-
451
- ```python
452
- price_patterns = js("""
453
- ({
454
- // Search results card price
455
- search_price: document.querySelector('[data-testid="price-and-discounted-price"]')?.innerText,
456
- // Property page room price
457
- room_price: document.querySelector('[data-testid="price-and-discounted-price"]')?.innerText,
458
- // Original (crossed-out) price before discount
459
- original_price: document.querySelector('[data-testid="recommended-units-price"] s')?.innerText
460
- || document.querySelector('.prco-valign-middle-helper del')?.innerText,
461
- // "Price for X nights" summary
462
- total_price: document.querySelector('[data-testid="checkout-price-summary"]')?.innerText,
463
- // Genius discount tag
464
- genius_discount: document.querySelector('[data-testid="genius-rate-badge"]')?.innerText,
465
- })
466
- """)
467
- ```
468
-
469
- **Price display nuances:**
470
- - Prices shown are **per night** by default; multiply by nights for total.
471
- - Currency is controlled by `selected_currency` URL param or user account
472
- setting.
473
- - Taxes/fees may or may not be included; look for `"Includes taxes and fees"`
474
- or `"+ taxes & fees"` text adjacent to the price element.
475
- - The `data-testid="price-and-discounted-price"` element returns a single
476
- string that may contain both original and discounted price
477
- (e.g., `"US$400\nUS$320"`).
478
-
479
- ---
480
-
481
- ## WAF Detection & Handling in Browser
482
-
483
- The WAF resolves automatically in a real Chrome session. To detect if
484
- something went wrong:
485
-
486
- ```python
487
- def check_booking_waf():
488
- url = page_info()["url"]
489
- html_snippet = js("document.body?.innerHTML?.slice(0, 500)") or ""
490
- return (
491
- "chal_t=" in url
492
- or "AwsWafIntegration" in html_snippet
493
- or "challenge-container" in html_snippet
494
- )
495
-
496
- def wait_past_waf(timeout=15):
497
- import time
498
- deadline = time.time() + timeout
499
- while time.time() < deadline:
500
- if not check_booking_waf():
501
- return True
502
- wait(1)
503
- return False # timed out — WAF didn't resolve
504
-
505
- # Use after goto_url():
506
- goto_url("https://www.booking.com/searchresults.html?ss=London&checkin=2026-06-01&checkout=2026-06-03&group_adults=2&no_rooms=1")
507
- wait_for_load()
508
- wait_past_waf()
509
- wait(2) # hydration
510
- ```
511
-
512
- ---
513
-
514
- ## Sitemap-Based URL Discovery Workflow
515
-
516
- Use this when you need a list of property URLs for a given country or city,
517
- without needing to scrape search results pages in the browser:
518
-
519
- ```python
520
- import gzip, re, urllib.request
521
-
522
- GOOGLEBOT = {"User-Agent": "Googlebot/2.1 (+http://www.google.com/bot.html)"}
523
-
524
- def get_hotel_urls_for_country(cc: str, lang: str = "en-gb", max_shards: int = 2) -> list[str]:
525
- """Returns property page URLs for a country from sitemaps. No browser needed."""
526
- idx_url = f"https://www.booking.com/sitembk-hotel-index.xml"
527
- idx = http_get(idx_url, headers=GOOGLEBOT)
528
- pattern = rf'<loc>(https://www\.booking\.com/sitembk-hotel-{lang}\.\d+\.xml\.gz)</loc>'
529
- shards = re.findall(pattern, idx)[:max_shards]
530
-
531
- urls = []
532
- for shard_url in shards:
533
- req = urllib.request.Request(shard_url, headers=GOOGLEBOT)
534
- with urllib.request.urlopen(req, timeout=60) as r:
535
- xml = gzip.decompress(r.read()).decode()
536
- all_urls = re.findall(r'<loc>(https://[^<]+)</loc>', xml)
537
- # Filter by country code
538
- country_urls = [u for u in all_urls if f"/hotel/{cc}/" in u]
539
- urls.extend(country_urls)
540
- return urls
541
-
542
- # Example: get French hotel URLs (no browser needed, instant)
543
- # french_hotels = get_hotel_urls_for_country("fr", max_shards=1)
544
- # len(french_hotels) -> ~8,000+ URLs from one shard
545
- ```
546
-
547
- ---
548
-
549
- ## Gotchas
550
-
551
- - **WAF blocks everything via `http_get`** — there is no User-Agent or header
552
- combination that bypasses it. The challenge is cryptographic, not heuristic.
553
- - **WAF has two page sizes** — ~3,962 bytes (newer SDK, no AJAX reporter) and
554
- ~8,410 bytes (older with error reporting). Both are equally blocked.
555
- - **Sitemaps whitelist Googlebot UA** — `Googlebot/2.1` UA works for sitemap
556
- XML/GZ files but NOT for hotel/city/search HTML pages.
557
- - **GraphQL endpoint is unprotected** but useless without a valid Booking.com
558
- session (irene service requires authentication for all substantive queries).
559
- - **GraphQL op-name whitelist**: introspection (`__schema`) is blocked by
560
- operation name restriction. Use field validation errors to probe the schema.
561
- - **GDPR consent banner**: shown after WAF resolves, before React renders
562
- search results. Must be dismissed (click `[data-testid="accept"]`) before
563
- interacting with EU sessions. Non-EU IPs may not see it.
564
- - **React hydration delay**: `wait_for_load()` fires before card data renders.
565
- Always add 2-3s of `wait()` after `wait_for_load()`.
566
- - **`sr-hotel` class is legacy** — Booking.com migrated to data-testid
567
- attributes. Use `[data-testid="property-card"]`, not `.sr-hotel`.
568
- - **Price parsing**: the price element often contains the full string
569
- `"US$400\nUS$320"` when a discount applies. Split on `\n` and take the last
570
- item for current price.
571
- - **Offset pagination cap**: Booking caps results at 1,000 properties per
572
- search (offset 0–975, rows=25). For cities with >1,000 properties, use
573
- filters (`nflt`) to segment results.
574
- - **Currency must be set via URL param**: `selected_currency=USD` in the search
575
- URL; the cookie-based currency selection may not persist across navigation.
576
- - **`dest_id` for cities**: Paris = `-1456928`, Amsterdam = `-2140479`,
577
- London = `-2601889`. Negative integers indicate city-level destinations.
578
- Get the ID by reading it from the URL after using `ss=` search.
1
+ # Booking.com — Scraping & Data Extraction
2
+
3
+ Field-tested against booking.com on 2026-04-18 using `http_get` and the
4
+ `dml/graphql` JSON API. All tests run without a browser session.
5
+
6
+ ---
7
+
8
+ ## TL;DR
9
+
10
+ **`http_get` returns nothing useful from booking.com.** Every HTML page —
11
+ search results, hotel pages, city pages, the homepage — is intercepted by an
12
+ AWS WAF JS challenge before any content is served. The challenge requires
13
+ JavaScript execution to complete a cryptographic puzzle and set an
14
+ `aws-waf-token` cookie. Without a real browser, you get a ~4-8 KB stub page.
15
+
16
+ **What you can do without a browser:**
17
+ - Enumerate hotel/city/region URLs from XML sitemaps (Googlebot UA required).
18
+ - Read `robots.txt` for URL pattern documentation.
19
+ - Query the GraphQL endpoint `https://www.booking.com/dml/graphql` for schema
20
+ exploration (no auth = internal errors, but validation errors reveal the
21
+ schema).
22
+
23
+ **For all actual data extraction, use the browser (`goto` + `js`).**
24
+
25
+ ---
26
+
27
+ ## AWS WAF JS Challenge — What It Is
28
+
29
+ Every `http_get` request to `www.booking.com` receives one of two variants of
30
+ a WAF stub:
31
+
32
+ **Variant A (~3,962 bytes) — modern SDK:**
33
+ ```html
34
+ <script src="https://www.booking.com/__challenge_{KEY}/{HASH}/challenge.js"></script>
35
+ <script>
36
+ AwsWafIntegration.getToken().then(() => { window.location.href = newHref; });
37
+ </script>
38
+ ```
39
+
40
+ **Variant B (~8,410 bytes) — with AJAX error reporting:**
41
+ Same AWS WAF SDK, plus an `XMLHttpRequest`-based error reporter that POSTs to
42
+ `https://reports.booking.com/chal_report`. This variant is more common on
43
+ non-browser UA strings.
44
+
45
+ **Detection in your code:**
46
+ ```python
47
+ def is_waf_blocked(html: str) -> bool:
48
+ return (
49
+ 'AwsWafIntegration' in html
50
+ or 'awsWafCookieDomainList' in html
51
+ or 'challenge.js' in html
52
+ or len(html) < 10_000 and '<title></title>' in html
53
+ )
54
+ ```
55
+
56
+ **What the challenge does:**
57
+ 1. Loads a 1.3 MB obfuscated JS file (`challenge.js`) from a path-keyed URL.
58
+ 2. Executes a cryptographic proof-of-work puzzle client-side.
59
+ 3. Sets an `aws-waf-token` cookie on the `booking.com` domain.
60
+ 4. Redirects to the original URL with `?chal_t={timestamp}&force_referer=`
61
+ appended.
62
+
63
+ This challenge **cannot be solved by `http_get`**. It requires a real JS
64
+ engine. A `bkng` session cookie is set on the first blocked response, but it
65
+ has no value without the WAF token.
66
+
67
+ **User agents tested — all blocked:**
68
+ - Chrome desktop (`Mozilla/5.0 ... Chrome/120`)
69
+ - iPhone/Safari mobile
70
+ - `Googlebot/2.1` (HTML pages only; sitemaps are whitelisted)
71
+ - Default `urllib` UA
72
+
73
+ ---
74
+
75
+ ## What `http_get` CAN Access
76
+
77
+ ### 1. XML Sitemaps (URL discovery)
78
+
79
+ Booking.com whitelists sitemap paths for Googlebot. This lets you enumerate
80
+ millions of property, city, region, and attraction URLs without a browser.
81
+
82
+ ```python
83
+ import gzip, re, urllib.request
84
+
85
+ GOOGLEBOT = {"User-Agent": "Googlebot/2.1 (+http://www.google.com/bot.html)"}
86
+
87
+ def fetch_sitemap_index(url: str) -> list[str]:
88
+ """Returns list of child sitemap URLs from an index sitemap."""
89
+ xml = http_get(url, headers=GOOGLEBOT)
90
+ return re.findall(r'<loc>(https://[^<]+)</loc>', xml)
91
+
92
+ def fetch_sitemap_gz(gz_url: str) -> list[str]:
93
+ """Decompresses a gzipped sitemap and returns all <loc> URLs."""
94
+ req = urllib.request.Request(gz_url, headers=GOOGLEBOT)
95
+ with urllib.request.urlopen(req, timeout=30) as r:
96
+ data = gzip.decompress(r.read())
97
+ return re.findall(r'<loc>(https://[^<]+)</loc>', data.decode())
98
+
99
+ # Example: get all en-gb hotel URLs
100
+ hotel_idx = http_get(
101
+ "https://www.booking.com/sitembk-hotel-index.xml",
102
+ headers=GOOGLEBOT
103
+ )
104
+ # 74 shards for en-gb; each shard has ~45,000-50,000 property URLs
105
+ en_gb_shards = re.findall(
106
+ r'<loc>(https://www\.booking\.com/sitembk-hotel-en-gb\.\d+\.xml\.gz)</loc>',
107
+ hotel_idx
108
+ )
109
+ # hotel_urls = fetch_sitemap_gz(en_gb_shards[0]) # ~50K URLs per shard
110
+ ```
111
+
112
+ **Available sitemap categories (confirmed, 275 total):**
113
+
114
+ | Index URL | Content |
115
+ |-----------|---------|
116
+ | `sitembk-hotel-index.xml` | All properties (~74 en-gb shards, ~3.5M URLs) |
117
+ | `sitembk-city-index.xml` | City landing pages (~6 en-gb shards, ~44K cities) |
118
+ | `sitembk-region-index.xml` | Region landing pages |
119
+ | `sitembk-country-index.xml` | Country landing pages |
120
+ | `sitembk-attractions-index.xml` | Attractions |
121
+ | `sitembk-hotel-review-index.xml` | Review pages |
122
+ | `sitembk-themed-city-{type}-index.xml` | Category-specific city pages (70+ types: hostels, luxury, spa, ski, etc.) |
123
+
124
+ ### 2. `robots.txt`
125
+
126
+ ```python
127
+ robots = http_get("https://www.booking.com/robots.txt", headers={"User-Agent": "Mozilla/5.0"})
128
+ ```
129
+
130
+ - Returns immediately, no WAF.
131
+ - 136 Disallow entries, 275 Sitemap declarations.
132
+ - Documents all URL structures (search results, hotel pages, booking flow, etc.).
133
+
134
+ ### 3. GraphQL Schema Exploration (no auth)
135
+
136
+ The endpoint `https://www.booking.com/dml/graphql` is **not WAF-protected**.
137
+ It accepts POST requests and returns JSON. Without a session, most queries
138
+ return `Internal Server Error` from the backend (`irene` service), but
139
+ **GraphQL validation errors fire before the backend** and reveal the schema.
140
+
141
+ ```python
142
+ import json, urllib.request, gzip
143
+
144
+ GQL_URL = "https://www.booking.com/dml/graphql?lang=en-gb"
145
+ GQL_HEADERS = {
146
+ "User-Agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36",
147
+ "Accept": "application/json",
148
+ "Content-Type": "application/json",
149
+ "Origin": "https://www.booking.com",
150
+ "Referer": "https://www.booking.com/searchresults.html",
151
+ "x-booking-context-action-name": "searchresults",
152
+ "x-booking-context-aid": "376510",
153
+ "x-booking-site-type-id": "1",
154
+ }
155
+
156
+ def gql(operation_name: str, query: str, variables: dict = None) -> dict:
157
+ payload = {"operationName": operation_name, "query": query}
158
+ if variables:
159
+ payload["variables"] = variables
160
+ req = urllib.request.Request(
161
+ GQL_URL,
162
+ data=json.dumps(payload).encode(),
163
+ headers=GQL_HEADERS,
164
+ method="POST"
165
+ )
166
+ with urllib.request.urlopen(req, timeout=20) as r:
167
+ data = r.read()
168
+ if r.headers.get("Content-Encoding") == "gzip":
169
+ data = gzip.decompress(data)
170
+ return json.loads(data.decode())
171
+ ```
172
+
173
+ **Confirmed Query type fields (schema, field-tested 2026-04-18):**
174
+
175
+ | Field | Input type | Notes |
176
+ |-------|-----------|-------|
177
+ | `searchQueries` | none | Root for hotel search; nested `.search(SearchQueryInput!)` |
178
+ | `searchBox` | `SearchBoxInput!` | Destination autocomplete / search form state |
179
+ | `searchProperties` | `SearchInput!` | Returns 500 without auth session |
180
+ | `propertyDetails` | `PropertyDetailsQueryInput!` | Returns 500 without auth session |
181
+ | `popularDestinations` | `PopularDestinationsInput!` | Returns validation error (type mismatch) |
182
+
183
+ **Important:** Booking.com GraphQL uses an **operation name whitelist** for
184
+ some operations. If you get `GRAPHQL_UNKNOWN_OPERATION_NAME`, try any of the
185
+ following confirmed working names: `SearchResultsPage`, `SearchQuery`,
186
+ `HotelCardsList`, `SearchResultsList`, `PropertySearch`, `BookingSearch`.
187
+
188
+ **Operation names that bypass the whitelist restriction** (all return
189
+ `{ data: { __typename: 'Query' } }` with `{ __typename }`):
190
+ - `SearchResultsPage` ✓ (confirmed, use this)
191
+
192
+ **The search query structure** (known but returns 500 without session):
193
+ ```graphql
194
+ query SearchResultsPage($input: SearchQueryInput!) {
195
+ searchQueries {
196
+ search(input: $input) {
197
+ __typename # Returns SearchQueryResult type
198
+ }
199
+ }
200
+ }
201
+ ```
202
+
203
+ With `SearchQueryInput` fields (inferred from URL parameters, confirmed
204
+ accepted by validation):
205
+ ```json
206
+ {
207
+ "dest_id": "-1456928",
208
+ "dest_type": "CITY",
209
+ "checkin": "2026-05-01",
210
+ "checkout": "2026-05-03",
211
+ "group_adults": "2",
212
+ "no_rooms": "1",
213
+ "group_children": "0",
214
+ "selected_currency": "USD"
215
+ }
216
+ ```
217
+
218
+ ---
219
+
220
+ ## URL Parameter Reference
221
+
222
+ ### Search Results
223
+ `https://www.booking.com/searchresults.html`
224
+
225
+ | Parameter | Type | Example | Notes |
226
+ |-----------|------|---------|-------|
227
+ | `ss` | string | `Paris` | Free-text: city, hotel name, address |
228
+ | `dest_id` | string | `-1456928` | Numeric city/region ID (negative = city) |
229
+ | `dest_type` | string | `CITY` | `CITY`, `REGION`, `COUNTRY`, `HOTEL`, `AIRPORT`, `DISTRICT`, `LANDMARK` |
230
+ | `checkin` | `YYYY-MM-DD` | `2026-05-01` | |
231
+ | `checkout` | `YYYY-MM-DD` | `2026-05-03` | |
232
+ | `group_adults` | int | `2` | |
233
+ | `no_rooms` | int | `1` | |
234
+ | `group_children` | int | `0` | |
235
+ | `age` | int (repeatable) | `5` | Child age; one per child |
236
+ | `selected_currency` | string | `USD` | ISO 4217 currency code |
237
+ | `lang` | string | `en-us` | BCP 47 locale |
238
+ | `nflt` | string | `ht_id=204;class=4` | Semicolon-separated filters |
239
+ | `order` | string | `price` | Sort: `price`, `class`, `review_score`, `distance`, `upsort_bh` |
240
+ | `offset` | int | `25` | Pagination (0-based, step 25) |
241
+ | `rows` | int | `25` | Results per page (max 25) |
242
+ | `map` | `1` | `1` | Map view mode |
243
+ | `src` | string | `searchresults` | Source context (cosmetic) |
244
+
245
+ **Common `nflt` filter codes:**
246
+ - `ht_id=204` — Hotels only
247
+ - `class=3;class=4;class=5` — Star rating
248
+ - `review_score=90` — Guest rating ≥ 9.0
249
+ - `fc=2` — Free cancellation
250
+ - `rm_types=…` — Room type
251
+ - `pri=1;pri=2` — Price tier (budget / mid / upscale)
252
+
253
+ ### Property Pages
254
+ `https://www.booking.com/hotel/{country_code}/{hotel_slug}.html`
255
+
256
+ Confirmed from sitemap (74 shards, ~3.5M properties):
257
+ ```
258
+ https://www.booking.com/hotel/{cc}/{slug}.html
259
+ https://www.booking.com/hotel/{cc}/{slug}.en-gb.html
260
+ https://www.booking.com/hotel/{cc}/{slug}.{lang}.html
261
+ ```
262
+ - `cc` = 2-letter ISO country code (e.g., `fr`, `us`, `gb`, `de`, `jp`)
263
+ - `slug` = hotel name, lowercase, hyphen-separated
264
+ - Locale suffix optional; omit for default (English)
265
+
266
+ ### City / Region / Country Pages
267
+ ```
268
+ https://www.booking.com/city/{cc}/{city-slug}.html
269
+ https://www.booking.com/region/{cc}/{region-slug}.html
270
+ https://www.booking.com/country/{cc}.html
271
+ ```
272
+
273
+ ---
274
+
275
+ ## Browser-Based Extraction (Required for All Data)
276
+
277
+ Since `http_get` is blocked, all actual data extraction requires the browser
278
+ (`goto` + `js`). The WAF challenge resolves automatically in Chrome.
279
+
280
+ ### Initial Navigation
281
+
282
+ ```python
283
+ # Always use new_tab() for the first Booking.com load in a session
284
+ tid = new_tab("https://www.booking.com/searchresults.html?ss=Paris&checkin=2026-05-01&checkout=2026-05-03&group_adults=2&no_rooms=1&selected_currency=USD")
285
+ wait_for_load()
286
+ wait(3) # React hydration takes ~3s after readyState=complete
287
+
288
+ # Check for WAF challenge still running (rare in real Chrome)
289
+ url = page_info()["url"]
290
+ if "chal_t=" in url:
291
+ wait(5) # WAF challenge resolving
292
+ wait_for_load()
293
+ ```
294
+
295
+ ### GDPR / Cookie Consent Banner (EU Visitors)
296
+
297
+ Shown to visitors with EU IP addresses or EU `Accept-Language` headers **after**
298
+ the WAF challenge resolves. It blocks interaction until dismissed.
299
+
300
+ ```python
301
+ def dismiss_cookie_banner():
302
+ # Booking.com uses data-testid="accept" on the Accept button
303
+ accepted = js("""
304
+ (function() {
305
+ var btn = document.querySelector('[data-testid="accept"]')
306
+ || document.querySelector('#onetrust-accept-btn-handler')
307
+ || document.querySelector('[aria-label*="Accept"]');
308
+ if (btn) { btn.click(); return true; }
309
+ return false;
310
+ })()
311
+ """)
312
+ return accepted
313
+
314
+ # Call immediately after load if you have an EU IP
315
+ if dismiss_cookie_banner():
316
+ wait(1)
317
+ ```
318
+
319
+ The consent banner does **not** appear in the WAF stub — it only renders after
320
+ the full React app loads. Non-EU visitors (US IP, `Accept-Language: en-US`)
321
+ may not see it at all.
322
+
323
+ ### Search Results Page Extraction
324
+
325
+ ```python
326
+ results = js("""
327
+ Array.from(document.querySelectorAll('[data-testid="property-card"]')).map(el => ({
328
+ name: el.querySelector('[data-testid="title"]')?.innerText?.trim(),
329
+ url: el.querySelector('[data-testid="title-link"]')?.href,
330
+ price: el.querySelector('[data-testid="price-and-discounted-price"]')?.innerText?.trim(),
331
+ rating: el.querySelector('[data-testid="review-score"]')?.innerText?.trim(),
332
+ stars: el.querySelectorAll('[data-testid="rating-stars"] svg').length,
333
+ location: el.querySelector('[data-testid="address"]')?.innerText?.trim(),
334
+ availability_note: el.querySelector('[data-testid="availability-rate-information"]')?.innerText?.trim(),
335
+ is_genius: !!el.querySelector('[data-testid="genius-label"]'),
336
+ }))
337
+ """)
338
+ ```
339
+
340
+ **Field notes:**
341
+ - `data-testid="property-card"` — confirmed selector for result cards (as of
342
+ 2025-2026; Booking migrated from `sr-hotel` class to data-testid attributes).
343
+ - `data-testid="price-and-discounted-price"` — contains the nightly rate;
344
+ may show original + discounted price together as text.
345
+ - `data-testid="review-score"` — contains both the numeric score (e.g.,
346
+ `"9.2"`) and the label (e.g., `"Superb"`); use `.split('\n')[0]` for score.
347
+ - `data-testid="rating-stars"` — star rating icons; count SVG children for
348
+ star count.
349
+ - Results are loaded asynchronously; 3s wait after `wait_for_load()` is
350
+ required for all cards to render.
351
+
352
+ ### Pagination
353
+
354
+ ```python
355
+ # Method 1: Next page button
356
+ next_btn = js("document.querySelector('[data-testid=\"pagination-next\"]')?.href")
357
+ if next_btn:
358
+ goto_url(next_btn)
359
+ wait_for_load()
360
+ wait(3)
361
+
362
+ # Method 2: Offset parameter (25 results per page)
363
+ current_url = page_info()["url"]
364
+ offset = 25 # next page
365
+ goto_url(current_url + f"&offset={offset}")
366
+ wait_for_load()
367
+ wait(3)
368
+ ```
369
+
370
+ ### Property / Hotel Page Extraction
371
+
372
+ ```python
373
+ detail = js("""
374
+ ({
375
+ name: document.querySelector('[data-testid="property-name"]')?.innerText?.trim()
376
+ || document.querySelector('h2.hp__hotel-name, h1.pp-hotel-name-title')?.innerText?.trim(),
377
+ rating: document.querySelector('[data-testid="rating-squares"]')
378
+ ? document.querySelectorAll('[data-testid="rating-squares"] svg').length
379
+ : null,
380
+ score: document.querySelector('[data-testid="review-score-right-component"] .ac4a7896c7')?.innerText
381
+ || document.querySelector('[aria-label*="Scored"]')?.getAttribute('aria-label'),
382
+ address: document.querySelector('[data-testid="PropertyHeaderAddressDesktop"]')?.innerText?.trim()
383
+ || document.querySelector('[id="hotel_address"]')?.innerText?.trim(),
384
+ description: document.querySelector('[data-testid="property-description-content"]')?.innerText?.trim()
385
+ || document.querySelector('#property_description_content')?.innerText?.trim(),
386
+ amenities: Array.from(document.querySelectorAll('[data-testid="facility-list-item"]'))
387
+ .map(e => e.innerText?.trim()).filter(Boolean),
388
+ room_types: Array.from(document.querySelectorAll('[data-testid="roomstable-accordion"]'))
389
+ .map(el => ({
390
+ name: el.querySelector('[data-testid="room-type-name"]')?.innerText?.trim(),
391
+ price: el.querySelector('[data-testid="price-and-discounted-price"]')?.innerText?.trim(),
392
+ })),
393
+ lat: document.querySelector('a[href*="maps.google"]')
394
+ ?.href?.match(/[?&]q=([^&]+)/)?.[1]?.split(',')[0],
395
+ lon: document.querySelector('a[href*="maps.google"]')
396
+ ?.href?.match(/[?&]q=([^&]+)/)?.[1]?.split(',')[1],
397
+ })
398
+ """)
399
+ ```
400
+
401
+ ### JSON-LD Schema (Property Pages)
402
+
403
+ Property pages embed JSON-LD when fully rendered in browser. The schema type
404
+ is `Hotel`:
405
+
406
+ ```python
407
+ ld_json = js("""
408
+ (function() {
409
+ for (var s of document.querySelectorAll('script[type="application/ld+json"]')) {
410
+ try {
411
+ var d = JSON.parse(s.textContent);
412
+ if (d['@type'] === 'Hotel' || d['@type'] === 'LodgingBusiness') return d;
413
+ } catch(e) {}
414
+ }
415
+ return null;
416
+ })()
417
+ """)
418
+ # Returns:
419
+ # {
420
+ # "@type": "Hotel",
421
+ # "name": "Hotel de Crillon",
422
+ # "aggregateRating": {"ratingValue": "9.2", "reviewCount": "1423"},
423
+ # "address": {"streetAddress": "10 Place de la Concorde", "addressLocality": "Paris", ...},
424
+ # "geo": {"latitude": 48.865, "longitude": 2.321},
425
+ # "starRating": {"ratingValue": 5}
426
+ # }
427
+ ```
428
+
429
+ JSON-LD is **not present in the WAF stub** — it only exists in the fully
430
+ rendered page. `http_get` will never see it.
431
+
432
+ ### Embedded JavaScript Data (`__NEXT_DATA__` / `b_hotel_data`)
433
+
434
+ Booking.com's React app may embed search state in `window.__NEXT_DATA__` or
435
+ legacy `b_hotel_data` globals. Access via:
436
+
437
+ ```python
438
+ next_data = js("window.__NEXT_DATA__") # dict or None
439
+ b_hotel = js("window.b_hotel_data") # dict or None — legacy pages
440
+ ```
441
+
442
+ These globals are not present in the WAF stub and their availability depends
443
+ on page version. Prefer data-testid selectors which are more stable.
444
+
445
+ ---
446
+
447
+ ## Pricing Extraction Patterns
448
+
449
+ Booking.com shows prices per night with multiple formatting variants:
450
+
451
+ ```python
452
+ price_patterns = js("""
453
+ ({
454
+ // Search results card price
455
+ search_price: document.querySelector('[data-testid="price-and-discounted-price"]')?.innerText,
456
+ // Property page room price
457
+ room_price: document.querySelector('[data-testid="price-and-discounted-price"]')?.innerText,
458
+ // Original (crossed-out) price before discount
459
+ original_price: document.querySelector('[data-testid="recommended-units-price"] s')?.innerText
460
+ || document.querySelector('.prco-valign-middle-helper del')?.innerText,
461
+ // "Price for X nights" summary
462
+ total_price: document.querySelector('[data-testid="checkout-price-summary"]')?.innerText,
463
+ // Genius discount tag
464
+ genius_discount: document.querySelector('[data-testid="genius-rate-badge"]')?.innerText,
465
+ })
466
+ """)
467
+ ```
468
+
469
+ **Price display nuances:**
470
+ - Prices shown are **per night** by default; multiply by nights for total.
471
+ - Currency is controlled by `selected_currency` URL param or user account
472
+ setting.
473
+ - Taxes/fees may or may not be included; look for `"Includes taxes and fees"`
474
+ or `"+ taxes & fees"` text adjacent to the price element.
475
+ - The `data-testid="price-and-discounted-price"` element returns a single
476
+ string that may contain both original and discounted price
477
+ (e.g., `"US$400\nUS$320"`).
478
+
479
+ ---
480
+
481
+ ## WAF Detection & Handling in Browser
482
+
483
+ The WAF resolves automatically in a real Chrome session. To detect if
484
+ something went wrong:
485
+
486
+ ```python
487
+ def check_booking_waf():
488
+ url = page_info()["url"]
489
+ html_snippet = js("document.body?.innerHTML?.slice(0, 500)") or ""
490
+ return (
491
+ "chal_t=" in url
492
+ or "AwsWafIntegration" in html_snippet
493
+ or "challenge-container" in html_snippet
494
+ )
495
+
496
+ def wait_past_waf(timeout=15):
497
+ import time
498
+ deadline = time.time() + timeout
499
+ while time.time() < deadline:
500
+ if not check_booking_waf():
501
+ return True
502
+ wait(1)
503
+ return False # timed out — WAF didn't resolve
504
+
505
+ # Use after goto_url():
506
+ goto_url("https://www.booking.com/searchresults.html?ss=London&checkin=2026-06-01&checkout=2026-06-03&group_adults=2&no_rooms=1")
507
+ wait_for_load()
508
+ wait_past_waf()
509
+ wait(2) # hydration
510
+ ```
511
+
512
+ ---
513
+
514
+ ## Sitemap-Based URL Discovery Workflow
515
+
516
+ Use this when you need a list of property URLs for a given country or city,
517
+ without needing to scrape search results pages in the browser:
518
+
519
+ ```python
520
+ import gzip, re, urllib.request
521
+
522
+ GOOGLEBOT = {"User-Agent": "Googlebot/2.1 (+http://www.google.com/bot.html)"}
523
+
524
+ def get_hotel_urls_for_country(cc: str, lang: str = "en-gb", max_shards: int = 2) -> list[str]:
525
+ """Returns property page URLs for a country from sitemaps. No browser needed."""
526
+ idx_url = f"https://www.booking.com/sitembk-hotel-index.xml"
527
+ idx = http_get(idx_url, headers=GOOGLEBOT)
528
+ pattern = rf'<loc>(https://www\.booking\.com/sitembk-hotel-{lang}\.\d+\.xml\.gz)</loc>'
529
+ shards = re.findall(pattern, idx)[:max_shards]
530
+
531
+ urls = []
532
+ for shard_url in shards:
533
+ req = urllib.request.Request(shard_url, headers=GOOGLEBOT)
534
+ with urllib.request.urlopen(req, timeout=60) as r:
535
+ xml = gzip.decompress(r.read()).decode()
536
+ all_urls = re.findall(r'<loc>(https://[^<]+)</loc>', xml)
537
+ # Filter by country code
538
+ country_urls = [u for u in all_urls if f"/hotel/{cc}/" in u]
539
+ urls.extend(country_urls)
540
+ return urls
541
+
542
+ # Example: get French hotel URLs (no browser needed, instant)
543
+ # french_hotels = get_hotel_urls_for_country("fr", max_shards=1)
544
+ # len(french_hotels) -> ~8,000+ URLs from one shard
545
+ ```
546
+
547
+ ---
548
+
549
+ ## Gotchas
550
+
551
+ - **WAF blocks everything via `http_get`** — there is no User-Agent or header
552
+ combination that bypasses it. The challenge is cryptographic, not heuristic.
553
+ - **WAF has two page sizes** — ~3,962 bytes (newer SDK, no AJAX reporter) and
554
+ ~8,410 bytes (older with error reporting). Both are equally blocked.
555
+ - **Sitemaps whitelist Googlebot UA** — `Googlebot/2.1` UA works for sitemap
556
+ XML/GZ files but NOT for hotel/city/search HTML pages.
557
+ - **GraphQL endpoint is unprotected** but useless without a valid Booking.com
558
+ session (irene service requires authentication for all substantive queries).
559
+ - **GraphQL op-name whitelist**: introspection (`__schema`) is blocked by
560
+ operation name restriction. Use field validation errors to probe the schema.
561
+ - **GDPR consent banner**: shown after WAF resolves, before React renders
562
+ search results. Must be dismissed (click `[data-testid="accept"]`) before
563
+ interacting with EU sessions. Non-EU IPs may not see it.
564
+ - **React hydration delay**: `wait_for_load()` fires before card data renders.
565
+ Always add 2-3s of `wait()` after `wait_for_load()`.
566
+ - **`sr-hotel` class is legacy** — Booking.com migrated to data-testid
567
+ attributes. Use `[data-testid="property-card"]`, not `.sr-hotel`.
568
+ - **Price parsing**: the price element often contains the full string
569
+ `"US$400\nUS$320"` when a discount applies. Split on `\n` and take the last
570
+ item for current price.
571
+ - **Offset pagination cap**: Booking caps results at 1,000 properties per
572
+ search (offset 0–975, rows=25). For cities with >1,000 properties, use
573
+ filters (`nflt`) to segment results.
574
+ - **Currency must be set via URL param**: `selected_currency=USD` in the search
575
+ URL; the cookie-based currency selection may not persist across navigation.
576
+ - **`dest_id` for cities**: Paris = `-1456928`, Amsterdam = `-2140479`,
577
+ London = `-2601889`. Negative integers indicate city-level destinations.
578
+ Get the ID by reading it from the URL after using `ss=` search.