@pencil-agent/nano-pencil 2.0.0-beta.9 → 2.0.0
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- package/README.md +267 -267
- package/dist/build-meta.json +3 -3
- package/dist/core/export-html/AGENT.md +11 -11
- package/dist/core/export-html/template.css +971 -971
- package/dist/core/export-html/template.html +54 -54
- package/dist/core/extensions-host/index.d.ts +1 -1
- package/dist/core/extensions-host/types.d.ts +5 -8
- package/dist/extensions/builtin/AGENT.md +115 -115
- package/dist/extensions/builtin/browser/AGENT.md +17 -17
- package/dist/extensions/builtin/browser/agent-workspace/agent_helpers.py +12 -12
- package/dist/extensions/builtin/browser/agent-workspace/domain-skills/amazon/product-search.md +198 -198
- package/dist/extensions/builtin/browser/agent-workspace/domain-skills/archive-org/scraping.md +341 -341
- package/dist/extensions/builtin/browser/agent-workspace/domain-skills/arxiv/scraping.md +311 -311
- package/dist/extensions/builtin/browser/agent-workspace/domain-skills/arxiv-bulk/scraping.md +333 -333
- package/dist/extensions/builtin/browser/agent-workspace/domain-skills/atlas/overview.md +70 -70
- package/dist/extensions/builtin/browser/agent-workspace/domain-skills/booking-com/scraping.md +578 -578
- package/dist/extensions/builtin/browser/agent-workspace/domain-skills/capterra/scraping.md +440 -440
- package/dist/extensions/builtin/browser/agent-workspace/domain-skills/centilebrain/generate-estimates.md +110 -110
- package/dist/extensions/builtin/browser/agent-workspace/domain-skills/coingecko/scraping.md +325 -325
- package/dist/extensions/builtin/browser/agent-workspace/domain-skills/coinmarketcap/scraping.md +463 -463
- package/dist/extensions/builtin/browser/agent-workspace/domain-skills/coursera/scraping.md +360 -360
- package/dist/extensions/builtin/browser/agent-workspace/domain-skills/craigslist/scraping.md +390 -390
- package/dist/extensions/builtin/browser/agent-workspace/domain-skills/crossref/scraping.md +568 -568
- package/dist/extensions/builtin/browser/agent-workspace/domain-skills/dev-to/scraping.md +323 -323
- package/dist/extensions/builtin/browser/agent-workspace/domain-skills/duckduckgo/scraping.md +349 -349
- package/dist/extensions/builtin/browser/agent-workspace/domain-skills/ebay/scraping.md +435 -435
- package/dist/extensions/builtin/browser/agent-workspace/domain-skills/etsy/scraping.md +506 -506
- package/dist/extensions/builtin/browser/agent-workspace/domain-skills/eventbrite/scraping.md +363 -363
- package/dist/extensions/builtin/browser/agent-workspace/domain-skills/expedia/automation.md +168 -168
- package/dist/extensions/builtin/browser/agent-workspace/domain-skills/facebook/groups.md +236 -236
- package/dist/extensions/builtin/browser/agent-workspace/domain-skills/facebook/pages.md +295 -295
- package/dist/extensions/builtin/browser/agent-workspace/domain-skills/framer/editor.md +108 -108
- package/dist/extensions/builtin/browser/agent-workspace/domain-skills/fred/scraping.md +493 -493
- package/dist/extensions/builtin/browser/agent-workspace/domain-skills/g2/scraping.md +580 -580
- package/dist/extensions/builtin/browser/agent-workspace/domain-skills/genius/scraping.md +511 -511
- package/dist/extensions/builtin/browser/agent-workspace/domain-skills/github/repo-actions.md +65 -65
- package/dist/extensions/builtin/browser/agent-workspace/domain-skills/github/scraping.md +184 -184
- package/dist/extensions/builtin/browser/agent-workspace/domain-skills/glassdoor/scraping.md +543 -543
- package/dist/extensions/builtin/browser/agent-workspace/domain-skills/gmail/compose.md +122 -122
- package/dist/extensions/builtin/browser/agent-workspace/domain-skills/goodreads/scraping.md +461 -461
- package/dist/extensions/builtin/browser/agent-workspace/domain-skills/gutenberg/scraping.md +383 -383
- package/dist/extensions/builtin/browser/agent-workspace/domain-skills/hackernews/scraping.md +243 -243
- package/dist/extensions/builtin/browser/agent-workspace/domain-skills/howlongtobeat/scraping.md +473 -473
- package/dist/extensions/builtin/browser/agent-workspace/domain-skills/imdb/scraping.md +271 -271
- package/dist/extensions/builtin/browser/agent-workspace/domain-skills/itch-io/scraping.md +436 -436
- package/dist/extensions/builtin/browser/agent-workspace/domain-skills/job-boards/indeed-glassdoor.md +1021 -1021
- package/dist/extensions/builtin/browser/agent-workspace/domain-skills/letterboxd/scraping.md +349 -349
- package/dist/extensions/builtin/browser/agent-workspace/domain-skills/linkedin/invitation-manager.md +109 -109
- package/dist/extensions/builtin/browser/agent-workspace/domain-skills/loom/folder-enumeration.md +170 -170
- package/dist/extensions/builtin/browser/agent-workspace/domain-skills/macrotrends/scraping.md +537 -537
- package/dist/extensions/builtin/browser/agent-workspace/domain-skills/medium/article-hydration.md +120 -120
- package/dist/extensions/builtin/browser/agent-workspace/domain-skills/medium/scraping.md +414 -414
- package/dist/extensions/builtin/browser/agent-workspace/domain-skills/metacritic/scraping.md +477 -477
- package/dist/extensions/builtin/browser/agent-workspace/domain-skills/musicbrainz/scraping.md +478 -478
- package/dist/extensions/builtin/browser/agent-workspace/domain-skills/nasa/scraping.md +339 -339
- package/dist/extensions/builtin/browser/agent-workspace/domain-skills/news-aggregation/multi-source.md +205 -205
- package/dist/extensions/builtin/browser/agent-workspace/domain-skills/open-library/scraping.md +472 -472
- package/dist/extensions/builtin/browser/agent-workspace/domain-skills/openalex/scraping.md +470 -470
- package/dist/extensions/builtin/browser/agent-workspace/domain-skills/openstreetmap/scraping.md +490 -490
- package/dist/extensions/builtin/browser/agent-workspace/domain-skills/package-registries/npm-pypi.md +478 -478
- package/dist/extensions/builtin/browser/agent-workspace/domain-skills/polymarket/scraping.md +234 -234
- package/dist/extensions/builtin/browser/agent-workspace/domain-skills/producthunt/scraping.md +307 -307
- package/dist/extensions/builtin/browser/agent-workspace/domain-skills/pubmed/scraping.md +421 -421
- package/dist/extensions/builtin/browser/agent-workspace/domain-skills/quora/scraping.md +364 -364
- package/dist/extensions/builtin/browser/agent-workspace/domain-skills/rawg/scraping.md +352 -352
- package/dist/extensions/builtin/browser/agent-workspace/domain-skills/reddit/scraping.md +124 -124
- package/dist/extensions/builtin/browser/agent-workspace/domain-skills/rest-countries/scraping.md +233 -233
- package/dist/extensions/builtin/browser/agent-workspace/domain-skills/sec-edgar/scraping.md +361 -361
- package/dist/extensions/builtin/browser/agent-workspace/domain-skills/shopify-admin/README.md +36 -36
- package/dist/extensions/builtin/browser/agent-workspace/domain-skills/shopify-admin/embedded-apps.md +72 -72
- package/dist/extensions/builtin/browser/agent-workspace/domain-skills/shopify-admin/knowledge-base.md +109 -109
- package/dist/extensions/builtin/browser/agent-workspace/domain-skills/shopify-admin/polaris-inputs.md +137 -137
- package/dist/extensions/builtin/browser/agent-workspace/domain-skills/soundcloud/scraping.md +362 -362
- package/dist/extensions/builtin/browser/agent-workspace/domain-skills/spotify/scraping.md +339 -339
- package/dist/extensions/builtin/browser/agent-workspace/domain-skills/stackoverflow/scraping.md +435 -435
- package/dist/extensions/builtin/browser/agent-workspace/domain-skills/steam/scraping.md +575 -575
- package/dist/extensions/builtin/browser/agent-workspace/domain-skills/substack/scraping.md +338 -338
- package/dist/extensions/builtin/browser/agent-workspace/domain-skills/thetechgeeks/pricing.md +52 -52
- package/dist/extensions/builtin/browser/agent-workspace/domain-skills/tiktok/upload.md +107 -107
- package/dist/extensions/builtin/browser/agent-workspace/domain-skills/tradingview/scraping.md +309 -309
- package/dist/extensions/builtin/browser/agent-workspace/domain-skills/trello/boards-and-lists.md +88 -88
- package/dist/extensions/builtin/browser/agent-workspace/domain-skills/trustpilot/scraping.md +375 -375
- package/dist/extensions/builtin/browser/agent-workspace/domain-skills/walmart/scraping.md +444 -444
- package/dist/extensions/builtin/browser/agent-workspace/domain-skills/wayback-machine/scraping.md +306 -306
- package/dist/extensions/builtin/browser/agent-workspace/domain-skills/weather/scraping.md +398 -398
- package/dist/extensions/builtin/browser/agent-workspace/domain-skills/wellfound/scraping.md +596 -596
- package/dist/extensions/builtin/browser/agent-workspace/domain-skills/world-bank/scraping.md +356 -356
- package/dist/extensions/builtin/browser/agent-workspace/domain-skills/xiaohongshu/scraping.md +84 -84
- package/dist/extensions/builtin/browser/agent-workspace/domain-skills/youtube/scraping.md +418 -418
- package/dist/extensions/builtin/browser/agent-workspace/domain-skills/zillow/scraping.md +433 -433
- package/dist/extensions/builtin/browser/browser.md +73 -73
- package/dist/extensions/builtin/browser/install.md +142 -142
- package/dist/extensions/builtin/browser/interaction-skills/connection.md +48 -48
- package/dist/extensions/builtin/browser/interaction-skills/cookies.md +3 -3
- package/dist/extensions/builtin/browser/interaction-skills/cross-origin-iframes.md +3 -3
- package/dist/extensions/builtin/browser/interaction-skills/dialogs.md +64 -64
- package/dist/extensions/builtin/browser/interaction-skills/downloads.md +3 -3
- package/dist/extensions/builtin/browser/interaction-skills/drag-and-drop.md +3 -3
- package/dist/extensions/builtin/browser/interaction-skills/dropdowns.md +3 -3
- package/dist/extensions/builtin/browser/interaction-skills/iframes.md +3 -3
- package/dist/extensions/builtin/browser/interaction-skills/network-requests.md +3 -3
- package/dist/extensions/builtin/browser/interaction-skills/print-as-pdf.md +3 -3
- package/dist/extensions/builtin/browser/interaction-skills/profile-sync.md +90 -90
- package/dist/extensions/builtin/browser/interaction-skills/screenshots.md +17 -17
- package/dist/extensions/builtin/browser/interaction-skills/scrolling.md +3 -3
- package/dist/extensions/builtin/browser/interaction-skills/shadow-dom.md +3 -3
- package/dist/extensions/builtin/browser/interaction-skills/tabs.md +69 -69
- package/dist/extensions/builtin/browser/interaction-skills/uploads.md +1 -1
- package/dist/extensions/builtin/browser/interaction-skills/viewport.md +3 -3
- package/dist/extensions/builtin/browser/src/browser_harness/AGENT.md +15 -15
- package/dist/extensions/builtin/browser/src/browser_harness/__init__.py +8 -8
- package/dist/extensions/builtin/browser/src/browser_harness/_ipc.py +90 -90
- package/dist/extensions/builtin/browser/src/browser_harness/admin.py +722 -722
- package/dist/extensions/builtin/browser/src/browser_harness/daemon.py +328 -328
- package/dist/extensions/builtin/browser/src/browser_harness/helpers.py +396 -396
- package/dist/extensions/builtin/browser/src/browser_harness/run.py +103 -103
- package/dist/extensions/builtin/discipline/skills/brainstorming/SKILL.md +33 -33
- package/dist/extensions/builtin/discipline/skills/executing-plans/SKILL.md +25 -25
- package/dist/extensions/builtin/discipline/skills/finishing-development-branch/SKILL.md +25 -25
- package/dist/extensions/builtin/discipline/skills/receiving-code-review/SKILL.md +22 -22
- package/dist/extensions/builtin/discipline/skills/requesting-code-review/SKILL.md +31 -31
- package/dist/extensions/builtin/discipline/skills/systematic-debugging/SKILL.md +28 -28
- package/dist/extensions/builtin/discipline/skills/test-driven-development/SKILL.md +32 -32
- package/dist/extensions/builtin/discipline/skills/using-git-worktrees/SKILL.md +25 -25
- package/dist/extensions/builtin/discipline/skills/verification-before-completion/SKILL.md +27 -27
- package/dist/extensions/builtin/discipline/skills/writing-plans/SKILL.md +26 -26
- package/dist/extensions/builtin/goal/README.md +67 -67
- package/dist/extensions/builtin/goal/goal-controller.js +1 -1
- package/dist/extensions/builtin/goal/goal-prompts.js +4 -4
- package/dist/extensions/builtin/grub/README.md +112 -112
- package/dist/extensions/builtin/link-world/agent-workspace/README.md +16 -16
- package/dist/extensions/builtin/link-world/internet-search/internet-search.md +65 -65
- package/dist/extensions/builtin/link-world/link-world-agent.md +82 -82
- package/dist/extensions/builtin/link-world/linkworld.md +313 -313
- package/dist/extensions/builtin/link-world/network-routing/network-routing.md +67 -67
- package/dist/extensions/builtin/loop/README.md +92 -92
- package/dist/extensions/builtin/mcp/figma-design.md +68 -68
- package/dist/extensions/builtin/mcp/mcp-management.md +85 -85
- package/dist/extensions/builtin/recap/AGENT.md +15 -15
- package/dist/extensions/builtin/sal/README.md +72 -72
- package/dist/extensions/builtin/security-audit/README.md +289 -289
- package/dist/extensions/builtin/team/AGENT.md +112 -112
- package/dist/extensions/builtin/team/TESTING.md +299 -299
- package/dist/extensions/builtin/token-save/README.md +56 -56
- package/dist/extensions/optional/AGENT.md +10 -10
- package/dist/index.d.ts +5 -30
- package/dist/index.js +1 -1
- package/dist/models.d.ts +7 -0
- package/dist/models.js +1 -0
- package/dist/modes/interactive/theme/dark.json +85 -85
- package/dist/modes/interactive/theme/light.json +84 -84
- package/dist/modes/interactive/theme/theme-schema.json +335 -335
- package/dist/modes/interactive/theme/warm.json +81 -81
- package/dist/node_modules/@pencil-agent/ai/dist/cli.js +0 -0
- package/dist/packages/protocol/src/flags.d.ts +20 -0
- package/dist/packages/protocol/src/flags.js +0 -0
- package/dist/packages/protocol/src/hooks.d.ts +17 -0
- package/dist/packages/protocol/src/hooks.js +0 -0
- package/dist/packages/protocol/src/index.d.ts +4 -2
- package/dist/packages/protocol/src/index.js +1 -1
- package/dist/packages/protocol/src/lifecycle.d.ts +11 -21
- package/dist/public-config.d.ts +12 -0
- package/dist/public-config.js +1 -0
- package/dist/runtime.d.ts +9 -0
- package/dist/runtime.js +1 -0
- package/dist/session-compaction.d.ts +7 -0
- package/dist/session-compaction.js +1 -0
- package/dist/session.d.ts +7 -0
- package/dist/session.js +1 -0
- package/dist/skills.d.ts +7 -0
- package/dist/skills.js +1 -0
- package/dist/tools.d.ts +7 -0
- package/dist/tools.js +1 -0
- package/docs/ACP/345/215/217/350/256/256/351/233/206/346/210/220/345/274/200/345/217/221/346/226/207/346/241/243.md +851 -0
- package/docs/SDK-TESTING.md +364 -0
- package/docs/codex-goal-command-impl.md +1055 -1055
- package/docs/codex-goal-vs-grub.md +500 -500
- package/docs/custom-provider.md +27 -27
- package/docs/extensions.md +27 -27
- package/docs/keybindings.md +27 -27
- package/docs/loop /351/207/215/346/236/204/345/256/214/346/210/220/346/200/273/347/273/223.md" +250 -250
- package/docs/loop /351/207/215/346/236/204/345/256/214/346/210/220/346/212/245/345/221/212.md" +122 -122
- package/docs/loop /351/207/215/346/236/204/346/226/271/346/241/210.md" +1222 -1222
- package/docs/loop /351/207/215/346/236/204/346/226/271/346/241/210/345/256/236/347/216/260/346/212/245/345/221/212.md" +158 -158
- package/docs/loop /351/207/215/346/236/204/346/226/271/346/241/210/345/257/271/346/257/224/345/210/206/346/236/220.md" +128 -128
- package/docs/loop /351/207/215/346/236/204/350/256/241/345/210/222.md" +320 -320
- package/docs/loop-usage-examples.md +214 -214
- package/docs/mem-core/346/212/200/346/234/257/346/226/207/346/241/243.md +593 -0
- package/docs/models.md +27 -27
- package/docs/packages.md +27 -27
- package/docs/pi-design-philosophy.md +457 -457
- package/docs/planmode.md +1987 -1987
- package/docs/prompt-templates.md +27 -27
- package/docs/providers.md +27 -27
- package/docs/sdk.md +27 -27
- package/docs/skills.md +27 -27
- package/docs/startup-performance-optimization.md +301 -0
- package/docs/themes.md +27 -27
- package/docs/tui.md +27 -27
- package/docs//350/256/244/347/237/245/345/234/260/345/233/276.md +47 -0
- package/package.json +190 -162
- package/docs/cc-agent-design.md +0 -1297
- package/docs/cc-tui-design.md +0 -1333
- package/docs/nanoPencil-/345/255/246/344/271/240/350/256/241/345/210/222.md +0 -170
- package/docs/scan-report.md +0 -3820
- package/docs//345/257/271/346/240/207Claude-Code.md +0 -1775
- package/docs//351/230/277/351/207/214/345/267/264/345/267/264/350/264/242/346/212/245/345/210/206/346/236/220/344/271/246.md +0 -261
package/dist/extensions/builtin/browser/agent-workspace/domain-skills/booking-com/scraping.md
CHANGED
|
@@ -1,578 +1,578 @@
|
|
|
1
|
-
# Booking.com — Scraping & Data Extraction
|
|
2
|
-
|
|
3
|
-
Field-tested against booking.com on 2026-04-18 using `http_get` and the
|
|
4
|
-
`dml/graphql` JSON API. All tests run without a browser session.
|
|
5
|
-
|
|
6
|
-
---
|
|
7
|
-
|
|
8
|
-
## TL;DR
|
|
9
|
-
|
|
10
|
-
**`http_get` returns nothing useful from booking.com.** Every HTML page —
|
|
11
|
-
search results, hotel pages, city pages, the homepage — is intercepted by an
|
|
12
|
-
AWS WAF JS challenge before any content is served. The challenge requires
|
|
13
|
-
JavaScript execution to complete a cryptographic puzzle and set an
|
|
14
|
-
`aws-waf-token` cookie. Without a real browser, you get a ~4-8 KB stub page.
|
|
15
|
-
|
|
16
|
-
**What you can do without a browser:**
|
|
17
|
-
- Enumerate hotel/city/region URLs from XML sitemaps (Googlebot UA required).
|
|
18
|
-
- Read `robots.txt` for URL pattern documentation.
|
|
19
|
-
- Query the GraphQL endpoint `https://www.booking.com/dml/graphql` for schema
|
|
20
|
-
exploration (no auth = internal errors, but validation errors reveal the
|
|
21
|
-
schema).
|
|
22
|
-
|
|
23
|
-
**For all actual data extraction, use the browser (`goto` + `js`).**
|
|
24
|
-
|
|
25
|
-
---
|
|
26
|
-
|
|
27
|
-
## AWS WAF JS Challenge — What It Is
|
|
28
|
-
|
|
29
|
-
Every `http_get` request to `www.booking.com` receives one of two variants of
|
|
30
|
-
a WAF stub:
|
|
31
|
-
|
|
32
|
-
**Variant A (~3,962 bytes) — modern SDK:**
|
|
33
|
-
```html
|
|
34
|
-
<script src="https://www.booking.com/__challenge_{KEY}/{HASH}/challenge.js"></script>
|
|
35
|
-
<script>
|
|
36
|
-
AwsWafIntegration.getToken().then(() => { window.location.href = newHref; });
|
|
37
|
-
</script>
|
|
38
|
-
```
|
|
39
|
-
|
|
40
|
-
**Variant B (~8,410 bytes) — with AJAX error reporting:**
|
|
41
|
-
Same AWS WAF SDK, plus an `XMLHttpRequest`-based error reporter that POSTs to
|
|
42
|
-
`https://reports.booking.com/chal_report`. This variant is more common on
|
|
43
|
-
non-browser UA strings.
|
|
44
|
-
|
|
45
|
-
**Detection in your code:**
|
|
46
|
-
```python
|
|
47
|
-
def is_waf_blocked(html: str) -> bool:
|
|
48
|
-
return (
|
|
49
|
-
'AwsWafIntegration' in html
|
|
50
|
-
or 'awsWafCookieDomainList' in html
|
|
51
|
-
or 'challenge.js' in html
|
|
52
|
-
or len(html) < 10_000 and '<title></title>' in html
|
|
53
|
-
)
|
|
54
|
-
```
|
|
55
|
-
|
|
56
|
-
**What the challenge does:**
|
|
57
|
-
1. Loads a 1.3 MB obfuscated JS file (`challenge.js`) from a path-keyed URL.
|
|
58
|
-
2. Executes a cryptographic proof-of-work puzzle client-side.
|
|
59
|
-
3. Sets an `aws-waf-token` cookie on the `booking.com` domain.
|
|
60
|
-
4. Redirects to the original URL with `?chal_t={timestamp}&force_referer=`
|
|
61
|
-
appended.
|
|
62
|
-
|
|
63
|
-
This challenge **cannot be solved by `http_get`**. It requires a real JS
|
|
64
|
-
engine. A `bkng` session cookie is set on the first blocked response, but it
|
|
65
|
-
has no value without the WAF token.
|
|
66
|
-
|
|
67
|
-
**User agents tested — all blocked:**
|
|
68
|
-
- Chrome desktop (`Mozilla/5.0 ... Chrome/120`)
|
|
69
|
-
- iPhone/Safari mobile
|
|
70
|
-
- `Googlebot/2.1` (HTML pages only; sitemaps are whitelisted)
|
|
71
|
-
- Default `urllib` UA
|
|
72
|
-
|
|
73
|
-
---
|
|
74
|
-
|
|
75
|
-
## What `http_get` CAN Access
|
|
76
|
-
|
|
77
|
-
### 1. XML Sitemaps (URL discovery)
|
|
78
|
-
|
|
79
|
-
Booking.com whitelists sitemap paths for Googlebot. This lets you enumerate
|
|
80
|
-
millions of property, city, region, and attraction URLs without a browser.
|
|
81
|
-
|
|
82
|
-
```python
|
|
83
|
-
import gzip, re, urllib.request
|
|
84
|
-
|
|
85
|
-
GOOGLEBOT = {"User-Agent": "Googlebot/2.1 (+http://www.google.com/bot.html)"}
|
|
86
|
-
|
|
87
|
-
def fetch_sitemap_index(url: str) -> list[str]:
|
|
88
|
-
"""Returns list of child sitemap URLs from an index sitemap."""
|
|
89
|
-
xml = http_get(url, headers=GOOGLEBOT)
|
|
90
|
-
return re.findall(r'<loc>(https://[^<]+)</loc>', xml)
|
|
91
|
-
|
|
92
|
-
def fetch_sitemap_gz(gz_url: str) -> list[str]:
|
|
93
|
-
"""Decompresses a gzipped sitemap and returns all <loc> URLs."""
|
|
94
|
-
req = urllib.request.Request(gz_url, headers=GOOGLEBOT)
|
|
95
|
-
with urllib.request.urlopen(req, timeout=30) as r:
|
|
96
|
-
data = gzip.decompress(r.read())
|
|
97
|
-
return re.findall(r'<loc>(https://[^<]+)</loc>', data.decode())
|
|
98
|
-
|
|
99
|
-
# Example: get all en-gb hotel URLs
|
|
100
|
-
hotel_idx = http_get(
|
|
101
|
-
"https://www.booking.com/sitembk-hotel-index.xml",
|
|
102
|
-
headers=GOOGLEBOT
|
|
103
|
-
)
|
|
104
|
-
# 74 shards for en-gb; each shard has ~45,000-50,000 property URLs
|
|
105
|
-
en_gb_shards = re.findall(
|
|
106
|
-
r'<loc>(https://www\.booking\.com/sitembk-hotel-en-gb\.\d+\.xml\.gz)</loc>',
|
|
107
|
-
hotel_idx
|
|
108
|
-
)
|
|
109
|
-
# hotel_urls = fetch_sitemap_gz(en_gb_shards[0]) # ~50K URLs per shard
|
|
110
|
-
```
|
|
111
|
-
|
|
112
|
-
**Available sitemap categories (confirmed, 275 total):**
|
|
113
|
-
|
|
114
|
-
| Index URL | Content |
|
|
115
|
-
|-----------|---------|
|
|
116
|
-
| `sitembk-hotel-index.xml` | All properties (~74 en-gb shards, ~3.5M URLs) |
|
|
117
|
-
| `sitembk-city-index.xml` | City landing pages (~6 en-gb shards, ~44K cities) |
|
|
118
|
-
| `sitembk-region-index.xml` | Region landing pages |
|
|
119
|
-
| `sitembk-country-index.xml` | Country landing pages |
|
|
120
|
-
| `sitembk-attractions-index.xml` | Attractions |
|
|
121
|
-
| `sitembk-hotel-review-index.xml` | Review pages |
|
|
122
|
-
| `sitembk-themed-city-{type}-index.xml` | Category-specific city pages (70+ types: hostels, luxury, spa, ski, etc.) |
|
|
123
|
-
|
|
124
|
-
### 2. `robots.txt`
|
|
125
|
-
|
|
126
|
-
```python
|
|
127
|
-
robots = http_get("https://www.booking.com/robots.txt", headers={"User-Agent": "Mozilla/5.0"})
|
|
128
|
-
```
|
|
129
|
-
|
|
130
|
-
- Returns immediately, no WAF.
|
|
131
|
-
- 136 Disallow entries, 275 Sitemap declarations.
|
|
132
|
-
- Documents all URL structures (search results, hotel pages, booking flow, etc.).
|
|
133
|
-
|
|
134
|
-
### 3. GraphQL Schema Exploration (no auth)
|
|
135
|
-
|
|
136
|
-
The endpoint `https://www.booking.com/dml/graphql` is **not WAF-protected**.
|
|
137
|
-
It accepts POST requests and returns JSON. Without a session, most queries
|
|
138
|
-
return `Internal Server Error` from the backend (`irene` service), but
|
|
139
|
-
**GraphQL validation errors fire before the backend** and reveal the schema.
|
|
140
|
-
|
|
141
|
-
```python
|
|
142
|
-
import json, urllib.request, gzip
|
|
143
|
-
|
|
144
|
-
GQL_URL = "https://www.booking.com/dml/graphql?lang=en-gb"
|
|
145
|
-
GQL_HEADERS = {
|
|
146
|
-
"User-Agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36",
|
|
147
|
-
"Accept": "application/json",
|
|
148
|
-
"Content-Type": "application/json",
|
|
149
|
-
"Origin": "https://www.booking.com",
|
|
150
|
-
"Referer": "https://www.booking.com/searchresults.html",
|
|
151
|
-
"x-booking-context-action-name": "searchresults",
|
|
152
|
-
"x-booking-context-aid": "376510",
|
|
153
|
-
"x-booking-site-type-id": "1",
|
|
154
|
-
}
|
|
155
|
-
|
|
156
|
-
def gql(operation_name: str, query: str, variables: dict = None) -> dict:
|
|
157
|
-
payload = {"operationName": operation_name, "query": query}
|
|
158
|
-
if variables:
|
|
159
|
-
payload["variables"] = variables
|
|
160
|
-
req = urllib.request.Request(
|
|
161
|
-
GQL_URL,
|
|
162
|
-
data=json.dumps(payload).encode(),
|
|
163
|
-
headers=GQL_HEADERS,
|
|
164
|
-
method="POST"
|
|
165
|
-
)
|
|
166
|
-
with urllib.request.urlopen(req, timeout=20) as r:
|
|
167
|
-
data = r.read()
|
|
168
|
-
if r.headers.get("Content-Encoding") == "gzip":
|
|
169
|
-
data = gzip.decompress(data)
|
|
170
|
-
return json.loads(data.decode())
|
|
171
|
-
```
|
|
172
|
-
|
|
173
|
-
**Confirmed Query type fields (schema, field-tested 2026-04-18):**
|
|
174
|
-
|
|
175
|
-
| Field | Input type | Notes |
|
|
176
|
-
|-------|-----------|-------|
|
|
177
|
-
| `searchQueries` | none | Root for hotel search; nested `.search(SearchQueryInput!)` |
|
|
178
|
-
| `searchBox` | `SearchBoxInput!` | Destination autocomplete / search form state |
|
|
179
|
-
| `searchProperties` | `SearchInput!` | Returns 500 without auth session |
|
|
180
|
-
| `propertyDetails` | `PropertyDetailsQueryInput!` | Returns 500 without auth session |
|
|
181
|
-
| `popularDestinations` | `PopularDestinationsInput!` | Returns validation error (type mismatch) |
|
|
182
|
-
|
|
183
|
-
**Important:** Booking.com GraphQL uses an **operation name whitelist** for
|
|
184
|
-
some operations. If you get `GRAPHQL_UNKNOWN_OPERATION_NAME`, try any of the
|
|
185
|
-
following confirmed working names: `SearchResultsPage`, `SearchQuery`,
|
|
186
|
-
`HotelCardsList`, `SearchResultsList`, `PropertySearch`, `BookingSearch`.
|
|
187
|
-
|
|
188
|
-
**Operation names that bypass the whitelist restriction** (all return
|
|
189
|
-
`{ data: { __typename: 'Query' } }` with `{ __typename }`):
|
|
190
|
-
- `SearchResultsPage` ✓ (confirmed, use this)
|
|
191
|
-
|
|
192
|
-
**The search query structure** (known but returns 500 without session):
|
|
193
|
-
```graphql
|
|
194
|
-
query SearchResultsPage($input: SearchQueryInput!) {
|
|
195
|
-
searchQueries {
|
|
196
|
-
search(input: $input) {
|
|
197
|
-
__typename # Returns SearchQueryResult type
|
|
198
|
-
}
|
|
199
|
-
}
|
|
200
|
-
}
|
|
201
|
-
```
|
|
202
|
-
|
|
203
|
-
With `SearchQueryInput` fields (inferred from URL parameters, confirmed
|
|
204
|
-
accepted by validation):
|
|
205
|
-
```json
|
|
206
|
-
{
|
|
207
|
-
"dest_id": "-1456928",
|
|
208
|
-
"dest_type": "CITY",
|
|
209
|
-
"checkin": "2026-05-01",
|
|
210
|
-
"checkout": "2026-05-03",
|
|
211
|
-
"group_adults": "2",
|
|
212
|
-
"no_rooms": "1",
|
|
213
|
-
"group_children": "0",
|
|
214
|
-
"selected_currency": "USD"
|
|
215
|
-
}
|
|
216
|
-
```
|
|
217
|
-
|
|
218
|
-
---
|
|
219
|
-
|
|
220
|
-
## URL Parameter Reference
|
|
221
|
-
|
|
222
|
-
### Search Results
|
|
223
|
-
`https://www.booking.com/searchresults.html`
|
|
224
|
-
|
|
225
|
-
| Parameter | Type | Example | Notes |
|
|
226
|
-
|-----------|------|---------|-------|
|
|
227
|
-
| `ss` | string | `Paris` | Free-text: city, hotel name, address |
|
|
228
|
-
| `dest_id` | string | `-1456928` | Numeric city/region ID (negative = city) |
|
|
229
|
-
| `dest_type` | string | `CITY` | `CITY`, `REGION`, `COUNTRY`, `HOTEL`, `AIRPORT`, `DISTRICT`, `LANDMARK` |
|
|
230
|
-
| `checkin` | `YYYY-MM-DD` | `2026-05-01` | |
|
|
231
|
-
| `checkout` | `YYYY-MM-DD` | `2026-05-03` | |
|
|
232
|
-
| `group_adults` | int | `2` | |
|
|
233
|
-
| `no_rooms` | int | `1` | |
|
|
234
|
-
| `group_children` | int | `0` | |
|
|
235
|
-
| `age` | int (repeatable) | `5` | Child age; one per child |
|
|
236
|
-
| `selected_currency` | string | `USD` | ISO 4217 currency code |
|
|
237
|
-
| `lang` | string | `en-us` | BCP 47 locale |
|
|
238
|
-
| `nflt` | string | `ht_id=204;class=4` | Semicolon-separated filters |
|
|
239
|
-
| `order` | string | `price` | Sort: `price`, `class`, `review_score`, `distance`, `upsort_bh` |
|
|
240
|
-
| `offset` | int | `25` | Pagination (0-based, step 25) |
|
|
241
|
-
| `rows` | int | `25` | Results per page (max 25) |
|
|
242
|
-
| `map` | `1` | `1` | Map view mode |
|
|
243
|
-
| `src` | string | `searchresults` | Source context (cosmetic) |
|
|
244
|
-
|
|
245
|
-
**Common `nflt` filter codes:**
|
|
246
|
-
- `ht_id=204` — Hotels only
|
|
247
|
-
- `class=3;class=4;class=5` — Star rating
|
|
248
|
-
- `review_score=90` — Guest rating ≥ 9.0
|
|
249
|
-
- `fc=2` — Free cancellation
|
|
250
|
-
- `rm_types=…` — Room type
|
|
251
|
-
- `pri=1;pri=2` — Price tier (budget / mid / upscale)
|
|
252
|
-
|
|
253
|
-
### Property Pages
|
|
254
|
-
`https://www.booking.com/hotel/{country_code}/{hotel_slug}.html`
|
|
255
|
-
|
|
256
|
-
Confirmed from sitemap (74 shards, ~3.5M properties):
|
|
257
|
-
```
|
|
258
|
-
https://www.booking.com/hotel/{cc}/{slug}.html
|
|
259
|
-
https://www.booking.com/hotel/{cc}/{slug}.en-gb.html
|
|
260
|
-
https://www.booking.com/hotel/{cc}/{slug}.{lang}.html
|
|
261
|
-
```
|
|
262
|
-
- `cc` = 2-letter ISO country code (e.g., `fr`, `us`, `gb`, `de`, `jp`)
|
|
263
|
-
- `slug` = hotel name, lowercase, hyphen-separated
|
|
264
|
-
- Locale suffix optional; omit for default (English)
|
|
265
|
-
|
|
266
|
-
### City / Region / Country Pages
|
|
267
|
-
```
|
|
268
|
-
https://www.booking.com/city/{cc}/{city-slug}.html
|
|
269
|
-
https://www.booking.com/region/{cc}/{region-slug}.html
|
|
270
|
-
https://www.booking.com/country/{cc}.html
|
|
271
|
-
```
|
|
272
|
-
|
|
273
|
-
---
|
|
274
|
-
|
|
275
|
-
## Browser-Based Extraction (Required for All Data)
|
|
276
|
-
|
|
277
|
-
Since `http_get` is blocked, all actual data extraction requires the browser
|
|
278
|
-
(`goto` + `js`). The WAF challenge resolves automatically in Chrome.
|
|
279
|
-
|
|
280
|
-
### Initial Navigation
|
|
281
|
-
|
|
282
|
-
```python
|
|
283
|
-
# Always use new_tab() for the first Booking.com load in a session
|
|
284
|
-
tid = new_tab("https://www.booking.com/searchresults.html?ss=Paris&checkin=2026-05-01&checkout=2026-05-03&group_adults=2&no_rooms=1&selected_currency=USD")
|
|
285
|
-
wait_for_load()
|
|
286
|
-
wait(3) # React hydration takes ~3s after readyState=complete
|
|
287
|
-
|
|
288
|
-
# Check for WAF challenge still running (rare in real Chrome)
|
|
289
|
-
url = page_info()["url"]
|
|
290
|
-
if "chal_t=" in url:
|
|
291
|
-
wait(5) # WAF challenge resolving
|
|
292
|
-
wait_for_load()
|
|
293
|
-
```
|
|
294
|
-
|
|
295
|
-
### GDPR / Cookie Consent Banner (EU Visitors)
|
|
296
|
-
|
|
297
|
-
Shown to visitors with EU IP addresses or EU `Accept-Language` headers **after**
|
|
298
|
-
the WAF challenge resolves. It blocks interaction until dismissed.
|
|
299
|
-
|
|
300
|
-
```python
|
|
301
|
-
def dismiss_cookie_banner():
|
|
302
|
-
# Booking.com uses data-testid="accept" on the Accept button
|
|
303
|
-
accepted = js("""
|
|
304
|
-
(function() {
|
|
305
|
-
var btn = document.querySelector('[data-testid="accept"]')
|
|
306
|
-
|| document.querySelector('#onetrust-accept-btn-handler')
|
|
307
|
-
|| document.querySelector('[aria-label*="Accept"]');
|
|
308
|
-
if (btn) { btn.click(); return true; }
|
|
309
|
-
return false;
|
|
310
|
-
})()
|
|
311
|
-
""")
|
|
312
|
-
return accepted
|
|
313
|
-
|
|
314
|
-
# Call immediately after load if you have an EU IP
|
|
315
|
-
if dismiss_cookie_banner():
|
|
316
|
-
wait(1)
|
|
317
|
-
```
|
|
318
|
-
|
|
319
|
-
The consent banner does **not** appear in the WAF stub — it only renders after
|
|
320
|
-
the full React app loads. Non-EU visitors (US IP, `Accept-Language: en-US`)
|
|
321
|
-
may not see it at all.
|
|
322
|
-
|
|
323
|
-
### Search Results Page Extraction
|
|
324
|
-
|
|
325
|
-
```python
|
|
326
|
-
results = js("""
|
|
327
|
-
Array.from(document.querySelectorAll('[data-testid="property-card"]')).map(el => ({
|
|
328
|
-
name: el.querySelector('[data-testid="title"]')?.innerText?.trim(),
|
|
329
|
-
url: el.querySelector('[data-testid="title-link"]')?.href,
|
|
330
|
-
price: el.querySelector('[data-testid="price-and-discounted-price"]')?.innerText?.trim(),
|
|
331
|
-
rating: el.querySelector('[data-testid="review-score"]')?.innerText?.trim(),
|
|
332
|
-
stars: el.querySelectorAll('[data-testid="rating-stars"] svg').length,
|
|
333
|
-
location: el.querySelector('[data-testid="address"]')?.innerText?.trim(),
|
|
334
|
-
availability_note: el.querySelector('[data-testid="availability-rate-information"]')?.innerText?.trim(),
|
|
335
|
-
is_genius: !!el.querySelector('[data-testid="genius-label"]'),
|
|
336
|
-
}))
|
|
337
|
-
""")
|
|
338
|
-
```
|
|
339
|
-
|
|
340
|
-
**Field notes:**
|
|
341
|
-
- `data-testid="property-card"` — confirmed selector for result cards (as of
|
|
342
|
-
2025-2026; Booking migrated from `sr-hotel` class to data-testid attributes).
|
|
343
|
-
- `data-testid="price-and-discounted-price"` — contains the nightly rate;
|
|
344
|
-
may show original + discounted price together as text.
|
|
345
|
-
- `data-testid="review-score"` — contains both the numeric score (e.g.,
|
|
346
|
-
`"9.2"`) and the label (e.g., `"Superb"`); use `.split('\n')[0]` for score.
|
|
347
|
-
- `data-testid="rating-stars"` — star rating icons; count SVG children for
|
|
348
|
-
star count.
|
|
349
|
-
- Results are loaded asynchronously; 3s wait after `wait_for_load()` is
|
|
350
|
-
required for all cards to render.
|
|
351
|
-
|
|
352
|
-
### Pagination
|
|
353
|
-
|
|
354
|
-
```python
|
|
355
|
-
# Method 1: Next page button
|
|
356
|
-
next_btn = js("document.querySelector('[data-testid=\"pagination-next\"]')?.href")
|
|
357
|
-
if next_btn:
|
|
358
|
-
goto_url(next_btn)
|
|
359
|
-
wait_for_load()
|
|
360
|
-
wait(3)
|
|
361
|
-
|
|
362
|
-
# Method 2: Offset parameter (25 results per page)
|
|
363
|
-
current_url = page_info()["url"]
|
|
364
|
-
offset = 25 # next page
|
|
365
|
-
goto_url(current_url + f"&offset={offset}")
|
|
366
|
-
wait_for_load()
|
|
367
|
-
wait(3)
|
|
368
|
-
```
|
|
369
|
-
|
|
370
|
-
### Property / Hotel Page Extraction
|
|
371
|
-
|
|
372
|
-
```python
|
|
373
|
-
detail = js("""
|
|
374
|
-
({
|
|
375
|
-
name: document.querySelector('[data-testid="property-name"]')?.innerText?.trim()
|
|
376
|
-
|| document.querySelector('h2.hp__hotel-name, h1.pp-hotel-name-title')?.innerText?.trim(),
|
|
377
|
-
rating: document.querySelector('[data-testid="rating-squares"]')
|
|
378
|
-
? document.querySelectorAll('[data-testid="rating-squares"] svg').length
|
|
379
|
-
: null,
|
|
380
|
-
score: document.querySelector('[data-testid="review-score-right-component"] .ac4a7896c7')?.innerText
|
|
381
|
-
|| document.querySelector('[aria-label*="Scored"]')?.getAttribute('aria-label'),
|
|
382
|
-
address: document.querySelector('[data-testid="PropertyHeaderAddressDesktop"]')?.innerText?.trim()
|
|
383
|
-
|| document.querySelector('[id="hotel_address"]')?.innerText?.trim(),
|
|
384
|
-
description: document.querySelector('[data-testid="property-description-content"]')?.innerText?.trim()
|
|
385
|
-
|| document.querySelector('#property_description_content')?.innerText?.trim(),
|
|
386
|
-
amenities: Array.from(document.querySelectorAll('[data-testid="facility-list-item"]'))
|
|
387
|
-
.map(e => e.innerText?.trim()).filter(Boolean),
|
|
388
|
-
room_types: Array.from(document.querySelectorAll('[data-testid="roomstable-accordion"]'))
|
|
389
|
-
.map(el => ({
|
|
390
|
-
name: el.querySelector('[data-testid="room-type-name"]')?.innerText?.trim(),
|
|
391
|
-
price: el.querySelector('[data-testid="price-and-discounted-price"]')?.innerText?.trim(),
|
|
392
|
-
})),
|
|
393
|
-
lat: document.querySelector('a[href*="maps.google"]')
|
|
394
|
-
?.href?.match(/[?&]q=([^&]+)/)?.[1]?.split(',')[0],
|
|
395
|
-
lon: document.querySelector('a[href*="maps.google"]')
|
|
396
|
-
?.href?.match(/[?&]q=([^&]+)/)?.[1]?.split(',')[1],
|
|
397
|
-
})
|
|
398
|
-
""")
|
|
399
|
-
```
|
|
400
|
-
|
|
401
|
-
### JSON-LD Schema (Property Pages)
|
|
402
|
-
|
|
403
|
-
Property pages embed JSON-LD when fully rendered in browser. The schema type
|
|
404
|
-
is `Hotel`:
|
|
405
|
-
|
|
406
|
-
```python
|
|
407
|
-
ld_json = js("""
|
|
408
|
-
(function() {
|
|
409
|
-
for (var s of document.querySelectorAll('script[type="application/ld+json"]')) {
|
|
410
|
-
try {
|
|
411
|
-
var d = JSON.parse(s.textContent);
|
|
412
|
-
if (d['@type'] === 'Hotel' || d['@type'] === 'LodgingBusiness') return d;
|
|
413
|
-
} catch(e) {}
|
|
414
|
-
}
|
|
415
|
-
return null;
|
|
416
|
-
})()
|
|
417
|
-
""")
|
|
418
|
-
# Returns:
|
|
419
|
-
# {
|
|
420
|
-
# "@type": "Hotel",
|
|
421
|
-
# "name": "Hotel de Crillon",
|
|
422
|
-
# "aggregateRating": {"ratingValue": "9.2", "reviewCount": "1423"},
|
|
423
|
-
# "address": {"streetAddress": "10 Place de la Concorde", "addressLocality": "Paris", ...},
|
|
424
|
-
# "geo": {"latitude": 48.865, "longitude": 2.321},
|
|
425
|
-
# "starRating": {"ratingValue": 5}
|
|
426
|
-
# }
|
|
427
|
-
```
|
|
428
|
-
|
|
429
|
-
JSON-LD is **not present in the WAF stub** — it only exists in the fully
|
|
430
|
-
rendered page. `http_get` will never see it.
|
|
431
|
-
|
|
432
|
-
### Embedded JavaScript Data (`__NEXT_DATA__` / `b_hotel_data`)
|
|
433
|
-
|
|
434
|
-
Booking.com's React app may embed search state in `window.__NEXT_DATA__` or
|
|
435
|
-
legacy `b_hotel_data` globals. Access via:
|
|
436
|
-
|
|
437
|
-
```python
|
|
438
|
-
next_data = js("window.__NEXT_DATA__") # dict or None
|
|
439
|
-
b_hotel = js("window.b_hotel_data") # dict or None — legacy pages
|
|
440
|
-
```
|
|
441
|
-
|
|
442
|
-
These globals are not present in the WAF stub and their availability depends
|
|
443
|
-
on page version. Prefer data-testid selectors which are more stable.
|
|
444
|
-
|
|
445
|
-
---
|
|
446
|
-
|
|
447
|
-
## Pricing Extraction Patterns
|
|
448
|
-
|
|
449
|
-
Booking.com shows prices per night with multiple formatting variants:
|
|
450
|
-
|
|
451
|
-
```python
|
|
452
|
-
price_patterns = js("""
|
|
453
|
-
({
|
|
454
|
-
// Search results card price
|
|
455
|
-
search_price: document.querySelector('[data-testid="price-and-discounted-price"]')?.innerText,
|
|
456
|
-
// Property page room price
|
|
457
|
-
room_price: document.querySelector('[data-testid="price-and-discounted-price"]')?.innerText,
|
|
458
|
-
// Original (crossed-out) price before discount
|
|
459
|
-
original_price: document.querySelector('[data-testid="recommended-units-price"] s')?.innerText
|
|
460
|
-
|| document.querySelector('.prco-valign-middle-helper del')?.innerText,
|
|
461
|
-
// "Price for X nights" summary
|
|
462
|
-
total_price: document.querySelector('[data-testid="checkout-price-summary"]')?.innerText,
|
|
463
|
-
// Genius discount tag
|
|
464
|
-
genius_discount: document.querySelector('[data-testid="genius-rate-badge"]')?.innerText,
|
|
465
|
-
})
|
|
466
|
-
""")
|
|
467
|
-
```
|
|
468
|
-
|
|
469
|
-
**Price display nuances:**
|
|
470
|
-
- Prices shown are **per night** by default; multiply by nights for total.
|
|
471
|
-
- Currency is controlled by `selected_currency` URL param or user account
|
|
472
|
-
setting.
|
|
473
|
-
- Taxes/fees may or may not be included; look for `"Includes taxes and fees"`
|
|
474
|
-
or `"+ taxes & fees"` text adjacent to the price element.
|
|
475
|
-
- The `data-testid="price-and-discounted-price"` element returns a single
|
|
476
|
-
string that may contain both original and discounted price
|
|
477
|
-
(e.g., `"US$400\nUS$320"`).
|
|
478
|
-
|
|
479
|
-
---
|
|
480
|
-
|
|
481
|
-
## WAF Detection & Handling in Browser
|
|
482
|
-
|
|
483
|
-
The WAF resolves automatically in a real Chrome session. To detect if
|
|
484
|
-
something went wrong:
|
|
485
|
-
|
|
486
|
-
```python
|
|
487
|
-
def check_booking_waf():
|
|
488
|
-
url = page_info()["url"]
|
|
489
|
-
html_snippet = js("document.body?.innerHTML?.slice(0, 500)") or ""
|
|
490
|
-
return (
|
|
491
|
-
"chal_t=" in url
|
|
492
|
-
or "AwsWafIntegration" in html_snippet
|
|
493
|
-
or "challenge-container" in html_snippet
|
|
494
|
-
)
|
|
495
|
-
|
|
496
|
-
def wait_past_waf(timeout=15):
|
|
497
|
-
import time
|
|
498
|
-
deadline = time.time() + timeout
|
|
499
|
-
while time.time() < deadline:
|
|
500
|
-
if not check_booking_waf():
|
|
501
|
-
return True
|
|
502
|
-
wait(1)
|
|
503
|
-
return False # timed out — WAF didn't resolve
|
|
504
|
-
|
|
505
|
-
# Use after goto_url():
|
|
506
|
-
goto_url("https://www.booking.com/searchresults.html?ss=London&checkin=2026-06-01&checkout=2026-06-03&group_adults=2&no_rooms=1")
|
|
507
|
-
wait_for_load()
|
|
508
|
-
wait_past_waf()
|
|
509
|
-
wait(2) # hydration
|
|
510
|
-
```
|
|
511
|
-
|
|
512
|
-
---
|
|
513
|
-
|
|
514
|
-
## Sitemap-Based URL Discovery Workflow
|
|
515
|
-
|
|
516
|
-
Use this when you need a list of property URLs for a given country or city,
|
|
517
|
-
without needing to scrape search results pages in the browser:
|
|
518
|
-
|
|
519
|
-
```python
|
|
520
|
-
import gzip, re, urllib.request
|
|
521
|
-
|
|
522
|
-
GOOGLEBOT = {"User-Agent": "Googlebot/2.1 (+http://www.google.com/bot.html)"}
|
|
523
|
-
|
|
524
|
-
def get_hotel_urls_for_country(cc: str, lang: str = "en-gb", max_shards: int = 2) -> list[str]:
|
|
525
|
-
"""Returns property page URLs for a country from sitemaps. No browser needed."""
|
|
526
|
-
idx_url = f"https://www.booking.com/sitembk-hotel-index.xml"
|
|
527
|
-
idx = http_get(idx_url, headers=GOOGLEBOT)
|
|
528
|
-
pattern = rf'<loc>(https://www\.booking\.com/sitembk-hotel-{lang}\.\d+\.xml\.gz)</loc>'
|
|
529
|
-
shards = re.findall(pattern, idx)[:max_shards]
|
|
530
|
-
|
|
531
|
-
urls = []
|
|
532
|
-
for shard_url in shards:
|
|
533
|
-
req = urllib.request.Request(shard_url, headers=GOOGLEBOT)
|
|
534
|
-
with urllib.request.urlopen(req, timeout=60) as r:
|
|
535
|
-
xml = gzip.decompress(r.read()).decode()
|
|
536
|
-
all_urls = re.findall(r'<loc>(https://[^<]+)</loc>', xml)
|
|
537
|
-
# Filter by country code
|
|
538
|
-
country_urls = [u for u in all_urls if f"/hotel/{cc}/" in u]
|
|
539
|
-
urls.extend(country_urls)
|
|
540
|
-
return urls
|
|
541
|
-
|
|
542
|
-
# Example: get French hotel URLs (no browser needed, instant)
|
|
543
|
-
# french_hotels = get_hotel_urls_for_country("fr", max_shards=1)
|
|
544
|
-
# len(french_hotels) -> ~8,000+ URLs from one shard
|
|
545
|
-
```
|
|
546
|
-
|
|
547
|
-
---
|
|
548
|
-
|
|
549
|
-
## Gotchas
|
|
550
|
-
|
|
551
|
-
- **WAF blocks everything via `http_get`** — there is no User-Agent or header
|
|
552
|
-
combination that bypasses it. The challenge is cryptographic, not heuristic.
|
|
553
|
-
- **WAF has two page sizes** — ~3,962 bytes (newer SDK, no AJAX reporter) and
|
|
554
|
-
~8,410 bytes (older with error reporting). Both are equally blocked.
|
|
555
|
-
- **Sitemaps whitelist Googlebot UA** — `Googlebot/2.1` UA works for sitemap
|
|
556
|
-
XML/GZ files but NOT for hotel/city/search HTML pages.
|
|
557
|
-
- **GraphQL endpoint is unprotected** but useless without a valid Booking.com
|
|
558
|
-
session (irene service requires authentication for all substantive queries).
|
|
559
|
-
- **GraphQL op-name whitelist**: introspection (`__schema`) is blocked by
|
|
560
|
-
operation name restriction. Use field validation errors to probe the schema.
|
|
561
|
-
- **GDPR consent banner**: shown after WAF resolves, before React renders
|
|
562
|
-
search results. Must be dismissed (click `[data-testid="accept"]`) before
|
|
563
|
-
interacting with EU sessions. Non-EU IPs may not see it.
|
|
564
|
-
- **React hydration delay**: `wait_for_load()` fires before card data renders.
|
|
565
|
-
Always add 2-3s of `wait()` after `wait_for_load()`.
|
|
566
|
-
- **`sr-hotel` class is legacy** — Booking.com migrated to data-testid
|
|
567
|
-
attributes. Use `[data-testid="property-card"]`, not `.sr-hotel`.
|
|
568
|
-
- **Price parsing**: the price element often contains the full string
|
|
569
|
-
`"US$400\nUS$320"` when a discount applies. Split on `\n` and take the last
|
|
570
|
-
item for current price.
|
|
571
|
-
- **Offset pagination cap**: Booking caps results at 1,000 properties per
|
|
572
|
-
search (offset 0–975, rows=25). For cities with >1,000 properties, use
|
|
573
|
-
filters (`nflt`) to segment results.
|
|
574
|
-
- **Currency must be set via URL param**: `selected_currency=USD` in the search
|
|
575
|
-
URL; the cookie-based currency selection may not persist across navigation.
|
|
576
|
-
- **`dest_id` for cities**: Paris = `-1456928`, Amsterdam = `-2140479`,
|
|
577
|
-
London = `-2601889`. Negative integers indicate city-level destinations.
|
|
578
|
-
Get the ID by reading it from the URL after using `ss=` search.
|
|
1
|
+
# Booking.com — Scraping & Data Extraction
|
|
2
|
+
|
|
3
|
+
Field-tested against booking.com on 2026-04-18 using `http_get` and the
|
|
4
|
+
`dml/graphql` JSON API. All tests run without a browser session.
|
|
5
|
+
|
|
6
|
+
---
|
|
7
|
+
|
|
8
|
+
## TL;DR
|
|
9
|
+
|
|
10
|
+
**`http_get` returns nothing useful from booking.com.** Every HTML page —
|
|
11
|
+
search results, hotel pages, city pages, the homepage — is intercepted by an
|
|
12
|
+
AWS WAF JS challenge before any content is served. The challenge requires
|
|
13
|
+
JavaScript execution to complete a cryptographic puzzle and set an
|
|
14
|
+
`aws-waf-token` cookie. Without a real browser, you get a ~4-8 KB stub page.
|
|
15
|
+
|
|
16
|
+
**What you can do without a browser:**
|
|
17
|
+
- Enumerate hotel/city/region URLs from XML sitemaps (Googlebot UA required).
|
|
18
|
+
- Read `robots.txt` for URL pattern documentation.
|
|
19
|
+
- Query the GraphQL endpoint `https://www.booking.com/dml/graphql` for schema
|
|
20
|
+
exploration (no auth = internal errors, but validation errors reveal the
|
|
21
|
+
schema).
|
|
22
|
+
|
|
23
|
+
**For all actual data extraction, use the browser (`goto` + `js`).**
|
|
24
|
+
|
|
25
|
+
---
|
|
26
|
+
|
|
27
|
+
## AWS WAF JS Challenge — What It Is
|
|
28
|
+
|
|
29
|
+
Every `http_get` request to `www.booking.com` receives one of two variants of
|
|
30
|
+
a WAF stub:
|
|
31
|
+
|
|
32
|
+
**Variant A (~3,962 bytes) — modern SDK:**
|
|
33
|
+
```html
|
|
34
|
+
<script src="https://www.booking.com/__challenge_{KEY}/{HASH}/challenge.js"></script>
|
|
35
|
+
<script>
|
|
36
|
+
AwsWafIntegration.getToken().then(() => { window.location.href = newHref; });
|
|
37
|
+
</script>
|
|
38
|
+
```
|
|
39
|
+
|
|
40
|
+
**Variant B (~8,410 bytes) — with AJAX error reporting:**
|
|
41
|
+
Same AWS WAF SDK, plus an `XMLHttpRequest`-based error reporter that POSTs to
|
|
42
|
+
`https://reports.booking.com/chal_report`. This variant is more common on
|
|
43
|
+
non-browser UA strings.
|
|
44
|
+
|
|
45
|
+
**Detection in your code:**
|
|
46
|
+
```python
|
|
47
|
+
def is_waf_blocked(html: str) -> bool:
|
|
48
|
+
return (
|
|
49
|
+
'AwsWafIntegration' in html
|
|
50
|
+
or 'awsWafCookieDomainList' in html
|
|
51
|
+
or 'challenge.js' in html
|
|
52
|
+
or len(html) < 10_000 and '<title></title>' in html
|
|
53
|
+
)
|
|
54
|
+
```
|
|
55
|
+
|
|
56
|
+
**What the challenge does:**
|
|
57
|
+
1. Loads a 1.3 MB obfuscated JS file (`challenge.js`) from a path-keyed URL.
|
|
58
|
+
2. Executes a cryptographic proof-of-work puzzle client-side.
|
|
59
|
+
3. Sets an `aws-waf-token` cookie on the `booking.com` domain.
|
|
60
|
+
4. Redirects to the original URL with `?chal_t={timestamp}&force_referer=`
|
|
61
|
+
appended.
|
|
62
|
+
|
|
63
|
+
This challenge **cannot be solved by `http_get`**. It requires a real JS
|
|
64
|
+
engine. A `bkng` session cookie is set on the first blocked response, but it
|
|
65
|
+
has no value without the WAF token.
|
|
66
|
+
|
|
67
|
+
**User agents tested — all blocked:**
|
|
68
|
+
- Chrome desktop (`Mozilla/5.0 ... Chrome/120`)
|
|
69
|
+
- iPhone/Safari mobile
|
|
70
|
+
- `Googlebot/2.1` (HTML pages only; sitemaps are whitelisted)
|
|
71
|
+
- Default `urllib` UA
|
|
72
|
+
|
|
73
|
+
---
|
|
74
|
+
|
|
75
|
+
## What `http_get` CAN Access
|
|
76
|
+
|
|
77
|
+
### 1. XML Sitemaps (URL discovery)
|
|
78
|
+
|
|
79
|
+
Booking.com whitelists sitemap paths for Googlebot. This lets you enumerate
|
|
80
|
+
millions of property, city, region, and attraction URLs without a browser.
|
|
81
|
+
|
|
82
|
+
```python
|
|
83
|
+
import gzip, re, urllib.request
|
|
84
|
+
|
|
85
|
+
GOOGLEBOT = {"User-Agent": "Googlebot/2.1 (+http://www.google.com/bot.html)"}
|
|
86
|
+
|
|
87
|
+
def fetch_sitemap_index(url: str) -> list[str]:
|
|
88
|
+
"""Returns list of child sitemap URLs from an index sitemap."""
|
|
89
|
+
xml = http_get(url, headers=GOOGLEBOT)
|
|
90
|
+
return re.findall(r'<loc>(https://[^<]+)</loc>', xml)
|
|
91
|
+
|
|
92
|
+
def fetch_sitemap_gz(gz_url: str) -> list[str]:
|
|
93
|
+
"""Decompresses a gzipped sitemap and returns all <loc> URLs."""
|
|
94
|
+
req = urllib.request.Request(gz_url, headers=GOOGLEBOT)
|
|
95
|
+
with urllib.request.urlopen(req, timeout=30) as r:
|
|
96
|
+
data = gzip.decompress(r.read())
|
|
97
|
+
return re.findall(r'<loc>(https://[^<]+)</loc>', data.decode())
|
|
98
|
+
|
|
99
|
+
# Example: get all en-gb hotel URLs
|
|
100
|
+
hotel_idx = http_get(
|
|
101
|
+
"https://www.booking.com/sitembk-hotel-index.xml",
|
|
102
|
+
headers=GOOGLEBOT
|
|
103
|
+
)
|
|
104
|
+
# 74 shards for en-gb; each shard has ~45,000-50,000 property URLs
|
|
105
|
+
en_gb_shards = re.findall(
|
|
106
|
+
r'<loc>(https://www\.booking\.com/sitembk-hotel-en-gb\.\d+\.xml\.gz)</loc>',
|
|
107
|
+
hotel_idx
|
|
108
|
+
)
|
|
109
|
+
# hotel_urls = fetch_sitemap_gz(en_gb_shards[0]) # ~50K URLs per shard
|
|
110
|
+
```
|
|
111
|
+
|
|
112
|
+
**Available sitemap categories (confirmed, 275 total):**
|
|
113
|
+
|
|
114
|
+
| Index URL | Content |
|
|
115
|
+
|-----------|---------|
|
|
116
|
+
| `sitembk-hotel-index.xml` | All properties (~74 en-gb shards, ~3.5M URLs) |
|
|
117
|
+
| `sitembk-city-index.xml` | City landing pages (~6 en-gb shards, ~44K cities) |
|
|
118
|
+
| `sitembk-region-index.xml` | Region landing pages |
|
|
119
|
+
| `sitembk-country-index.xml` | Country landing pages |
|
|
120
|
+
| `sitembk-attractions-index.xml` | Attractions |
|
|
121
|
+
| `sitembk-hotel-review-index.xml` | Review pages |
|
|
122
|
+
| `sitembk-themed-city-{type}-index.xml` | Category-specific city pages (70+ types: hostels, luxury, spa, ski, etc.) |
|
|
123
|
+
|
|
124
|
+
### 2. `robots.txt`
|
|
125
|
+
|
|
126
|
+
```python
|
|
127
|
+
robots = http_get("https://www.booking.com/robots.txt", headers={"User-Agent": "Mozilla/5.0"})
|
|
128
|
+
```
|
|
129
|
+
|
|
130
|
+
- Returns immediately, no WAF.
|
|
131
|
+
- 136 Disallow entries, 275 Sitemap declarations.
|
|
132
|
+
- Documents all URL structures (search results, hotel pages, booking flow, etc.).
|
|
133
|
+
|
|
134
|
+
### 3. GraphQL Schema Exploration (no auth)
|
|
135
|
+
|
|
136
|
+
The endpoint `https://www.booking.com/dml/graphql` is **not WAF-protected**.
|
|
137
|
+
It accepts POST requests and returns JSON. Without a session, most queries
|
|
138
|
+
return `Internal Server Error` from the backend (`irene` service), but
|
|
139
|
+
**GraphQL validation errors fire before the backend** and reveal the schema.
|
|
140
|
+
|
|
141
|
+
```python
|
|
142
|
+
import json, urllib.request, gzip
|
|
143
|
+
|
|
144
|
+
GQL_URL = "https://www.booking.com/dml/graphql?lang=en-gb"
|
|
145
|
+
GQL_HEADERS = {
|
|
146
|
+
"User-Agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36",
|
|
147
|
+
"Accept": "application/json",
|
|
148
|
+
"Content-Type": "application/json",
|
|
149
|
+
"Origin": "https://www.booking.com",
|
|
150
|
+
"Referer": "https://www.booking.com/searchresults.html",
|
|
151
|
+
"x-booking-context-action-name": "searchresults",
|
|
152
|
+
"x-booking-context-aid": "376510",
|
|
153
|
+
"x-booking-site-type-id": "1",
|
|
154
|
+
}
|
|
155
|
+
|
|
156
|
+
def gql(operation_name: str, query: str, variables: dict = None) -> dict:
|
|
157
|
+
payload = {"operationName": operation_name, "query": query}
|
|
158
|
+
if variables:
|
|
159
|
+
payload["variables"] = variables
|
|
160
|
+
req = urllib.request.Request(
|
|
161
|
+
GQL_URL,
|
|
162
|
+
data=json.dumps(payload).encode(),
|
|
163
|
+
headers=GQL_HEADERS,
|
|
164
|
+
method="POST"
|
|
165
|
+
)
|
|
166
|
+
with urllib.request.urlopen(req, timeout=20) as r:
|
|
167
|
+
data = r.read()
|
|
168
|
+
if r.headers.get("Content-Encoding") == "gzip":
|
|
169
|
+
data = gzip.decompress(data)
|
|
170
|
+
return json.loads(data.decode())
|
|
171
|
+
```
|
|
172
|
+
|
|
173
|
+
**Confirmed Query type fields (schema, field-tested 2026-04-18):**
|
|
174
|
+
|
|
175
|
+
| Field | Input type | Notes |
|
|
176
|
+
|-------|-----------|-------|
|
|
177
|
+
| `searchQueries` | none | Root for hotel search; nested `.search(SearchQueryInput!)` |
|
|
178
|
+
| `searchBox` | `SearchBoxInput!` | Destination autocomplete / search form state |
|
|
179
|
+
| `searchProperties` | `SearchInput!` | Returns 500 without auth session |
|
|
180
|
+
| `propertyDetails` | `PropertyDetailsQueryInput!` | Returns 500 without auth session |
|
|
181
|
+
| `popularDestinations` | `PopularDestinationsInput!` | Returns validation error (type mismatch) |
|
|
182
|
+
|
|
183
|
+
**Important:** Booking.com GraphQL uses an **operation name whitelist** for
|
|
184
|
+
some operations. If you get `GRAPHQL_UNKNOWN_OPERATION_NAME`, try any of the
|
|
185
|
+
following confirmed working names: `SearchResultsPage`, `SearchQuery`,
|
|
186
|
+
`HotelCardsList`, `SearchResultsList`, `PropertySearch`, `BookingSearch`.
|
|
187
|
+
|
|
188
|
+
**Operation names that bypass the whitelist restriction** (all return
|
|
189
|
+
`{ data: { __typename: 'Query' } }` with `{ __typename }`):
|
|
190
|
+
- `SearchResultsPage` ✓ (confirmed, use this)
|
|
191
|
+
|
|
192
|
+
**The search query structure** (known but returns 500 without session):
|
|
193
|
+
```graphql
|
|
194
|
+
query SearchResultsPage($input: SearchQueryInput!) {
|
|
195
|
+
searchQueries {
|
|
196
|
+
search(input: $input) {
|
|
197
|
+
__typename # Returns SearchQueryResult type
|
|
198
|
+
}
|
|
199
|
+
}
|
|
200
|
+
}
|
|
201
|
+
```
|
|
202
|
+
|
|
203
|
+
With `SearchQueryInput` fields (inferred from URL parameters, confirmed
|
|
204
|
+
accepted by validation):
|
|
205
|
+
```json
|
|
206
|
+
{
|
|
207
|
+
"dest_id": "-1456928",
|
|
208
|
+
"dest_type": "CITY",
|
|
209
|
+
"checkin": "2026-05-01",
|
|
210
|
+
"checkout": "2026-05-03",
|
|
211
|
+
"group_adults": "2",
|
|
212
|
+
"no_rooms": "1",
|
|
213
|
+
"group_children": "0",
|
|
214
|
+
"selected_currency": "USD"
|
|
215
|
+
}
|
|
216
|
+
```
|
|
217
|
+
|
|
218
|
+
---
|
|
219
|
+
|
|
220
|
+
## URL Parameter Reference
|
|
221
|
+
|
|
222
|
+
### Search Results
|
|
223
|
+
`https://www.booking.com/searchresults.html`
|
|
224
|
+
|
|
225
|
+
| Parameter | Type | Example | Notes |
|
|
226
|
+
|-----------|------|---------|-------|
|
|
227
|
+
| `ss` | string | `Paris` | Free-text: city, hotel name, address |
|
|
228
|
+
| `dest_id` | string | `-1456928` | Numeric city/region ID (negative = city) |
|
|
229
|
+
| `dest_type` | string | `CITY` | `CITY`, `REGION`, `COUNTRY`, `HOTEL`, `AIRPORT`, `DISTRICT`, `LANDMARK` |
|
|
230
|
+
| `checkin` | `YYYY-MM-DD` | `2026-05-01` | |
|
|
231
|
+
| `checkout` | `YYYY-MM-DD` | `2026-05-03` | |
|
|
232
|
+
| `group_adults` | int | `2` | |
|
|
233
|
+
| `no_rooms` | int | `1` | |
|
|
234
|
+
| `group_children` | int | `0` | |
|
|
235
|
+
| `age` | int (repeatable) | `5` | Child age; one per child |
|
|
236
|
+
| `selected_currency` | string | `USD` | ISO 4217 currency code |
|
|
237
|
+
| `lang` | string | `en-us` | BCP 47 locale |
|
|
238
|
+
| `nflt` | string | `ht_id=204;class=4` | Semicolon-separated filters |
|
|
239
|
+
| `order` | string | `price` | Sort: `price`, `class`, `review_score`, `distance`, `upsort_bh` |
|
|
240
|
+
| `offset` | int | `25` | Pagination (0-based, step 25) |
|
|
241
|
+
| `rows` | int | `25` | Results per page (max 25) |
|
|
242
|
+
| `map` | `1` | `1` | Map view mode |
|
|
243
|
+
| `src` | string | `searchresults` | Source context (cosmetic) |
|
|
244
|
+
|
|
245
|
+
**Common `nflt` filter codes:**
|
|
246
|
+
- `ht_id=204` — Hotels only
|
|
247
|
+
- `class=3;class=4;class=5` — Star rating
|
|
248
|
+
- `review_score=90` — Guest rating ≥ 9.0
|
|
249
|
+
- `fc=2` — Free cancellation
|
|
250
|
+
- `rm_types=…` — Room type
|
|
251
|
+
- `pri=1;pri=2` — Price tier (budget / mid / upscale)
|
|
252
|
+
|
|
253
|
+
### Property Pages
|
|
254
|
+
`https://www.booking.com/hotel/{country_code}/{hotel_slug}.html`
|
|
255
|
+
|
|
256
|
+
Confirmed from sitemap (74 shards, ~3.5M properties):
|
|
257
|
+
```
|
|
258
|
+
https://www.booking.com/hotel/{cc}/{slug}.html
|
|
259
|
+
https://www.booking.com/hotel/{cc}/{slug}.en-gb.html
|
|
260
|
+
https://www.booking.com/hotel/{cc}/{slug}.{lang}.html
|
|
261
|
+
```
|
|
262
|
+
- `cc` = 2-letter ISO country code (e.g., `fr`, `us`, `gb`, `de`, `jp`)
|
|
263
|
+
- `slug` = hotel name, lowercase, hyphen-separated
|
|
264
|
+
- Locale suffix optional; omit for default (English)
|
|
265
|
+
|
|
266
|
+
### City / Region / Country Pages
|
|
267
|
+
```
|
|
268
|
+
https://www.booking.com/city/{cc}/{city-slug}.html
|
|
269
|
+
https://www.booking.com/region/{cc}/{region-slug}.html
|
|
270
|
+
https://www.booking.com/country/{cc}.html
|
|
271
|
+
```
|
|
272
|
+
|
|
273
|
+
---
|
|
274
|
+
|
|
275
|
+
## Browser-Based Extraction (Required for All Data)
|
|
276
|
+
|
|
277
|
+
Since `http_get` is blocked, all actual data extraction requires the browser
|
|
278
|
+
(`goto` + `js`). The WAF challenge resolves automatically in Chrome.
|
|
279
|
+
|
|
280
|
+
### Initial Navigation
|
|
281
|
+
|
|
282
|
+
```python
|
|
283
|
+
# Always use new_tab() for the first Booking.com load in a session
|
|
284
|
+
tid = new_tab("https://www.booking.com/searchresults.html?ss=Paris&checkin=2026-05-01&checkout=2026-05-03&group_adults=2&no_rooms=1&selected_currency=USD")
|
|
285
|
+
wait_for_load()
|
|
286
|
+
wait(3) # React hydration takes ~3s after readyState=complete
|
|
287
|
+
|
|
288
|
+
# Check for WAF challenge still running (rare in real Chrome)
|
|
289
|
+
url = page_info()["url"]
|
|
290
|
+
if "chal_t=" in url:
|
|
291
|
+
wait(5) # WAF challenge resolving
|
|
292
|
+
wait_for_load()
|
|
293
|
+
```
|
|
294
|
+
|
|
295
|
+
### GDPR / Cookie Consent Banner (EU Visitors)
|
|
296
|
+
|
|
297
|
+
Shown to visitors with EU IP addresses or EU `Accept-Language` headers **after**
|
|
298
|
+
the WAF challenge resolves. It blocks interaction until dismissed.
|
|
299
|
+
|
|
300
|
+
```python
|
|
301
|
+
def dismiss_cookie_banner():
|
|
302
|
+
# Booking.com uses data-testid="accept" on the Accept button
|
|
303
|
+
accepted = js("""
|
|
304
|
+
(function() {
|
|
305
|
+
var btn = document.querySelector('[data-testid="accept"]')
|
|
306
|
+
|| document.querySelector('#onetrust-accept-btn-handler')
|
|
307
|
+
|| document.querySelector('[aria-label*="Accept"]');
|
|
308
|
+
if (btn) { btn.click(); return true; }
|
|
309
|
+
return false;
|
|
310
|
+
})()
|
|
311
|
+
""")
|
|
312
|
+
return accepted
|
|
313
|
+
|
|
314
|
+
# Call immediately after load if you have an EU IP
|
|
315
|
+
if dismiss_cookie_banner():
|
|
316
|
+
wait(1)
|
|
317
|
+
```
|
|
318
|
+
|
|
319
|
+
The consent banner does **not** appear in the WAF stub — it only renders after
|
|
320
|
+
the full React app loads. Non-EU visitors (US IP, `Accept-Language: en-US`)
|
|
321
|
+
may not see it at all.
|
|
322
|
+
|
|
323
|
+
### Search Results Page Extraction
|
|
324
|
+
|
|
325
|
+
```python
|
|
326
|
+
results = js("""
|
|
327
|
+
Array.from(document.querySelectorAll('[data-testid="property-card"]')).map(el => ({
|
|
328
|
+
name: el.querySelector('[data-testid="title"]')?.innerText?.trim(),
|
|
329
|
+
url: el.querySelector('[data-testid="title-link"]')?.href,
|
|
330
|
+
price: el.querySelector('[data-testid="price-and-discounted-price"]')?.innerText?.trim(),
|
|
331
|
+
rating: el.querySelector('[data-testid="review-score"]')?.innerText?.trim(),
|
|
332
|
+
stars: el.querySelectorAll('[data-testid="rating-stars"] svg').length,
|
|
333
|
+
location: el.querySelector('[data-testid="address"]')?.innerText?.trim(),
|
|
334
|
+
availability_note: el.querySelector('[data-testid="availability-rate-information"]')?.innerText?.trim(),
|
|
335
|
+
is_genius: !!el.querySelector('[data-testid="genius-label"]'),
|
|
336
|
+
}))
|
|
337
|
+
""")
|
|
338
|
+
```
|
|
339
|
+
|
|
340
|
+
**Field notes:**
|
|
341
|
+
- `data-testid="property-card"` — confirmed selector for result cards (as of
|
|
342
|
+
2025-2026; Booking migrated from `sr-hotel` class to data-testid attributes).
|
|
343
|
+
- `data-testid="price-and-discounted-price"` — contains the nightly rate;
|
|
344
|
+
may show original + discounted price together as text.
|
|
345
|
+
- `data-testid="review-score"` — contains both the numeric score (e.g.,
|
|
346
|
+
`"9.2"`) and the label (e.g., `"Superb"`); use `.split('\n')[0]` for score.
|
|
347
|
+
- `data-testid="rating-stars"` — star rating icons; count SVG children for
|
|
348
|
+
star count.
|
|
349
|
+
- Results are loaded asynchronously; 3s wait after `wait_for_load()` is
|
|
350
|
+
required for all cards to render.
|
|
351
|
+
|
|
352
|
+
### Pagination
|
|
353
|
+
|
|
354
|
+
```python
|
|
355
|
+
# Method 1: Next page button
|
|
356
|
+
next_btn = js("document.querySelector('[data-testid=\"pagination-next\"]')?.href")
|
|
357
|
+
if next_btn:
|
|
358
|
+
goto_url(next_btn)
|
|
359
|
+
wait_for_load()
|
|
360
|
+
wait(3)
|
|
361
|
+
|
|
362
|
+
# Method 2: Offset parameter (25 results per page)
|
|
363
|
+
current_url = page_info()["url"]
|
|
364
|
+
offset = 25 # next page
|
|
365
|
+
goto_url(current_url + f"&offset={offset}")
|
|
366
|
+
wait_for_load()
|
|
367
|
+
wait(3)
|
|
368
|
+
```
|
|
369
|
+
|
|
370
|
+
### Property / Hotel Page Extraction
|
|
371
|
+
|
|
372
|
+
```python
|
|
373
|
+
detail = js("""
|
|
374
|
+
({
|
|
375
|
+
name: document.querySelector('[data-testid="property-name"]')?.innerText?.trim()
|
|
376
|
+
|| document.querySelector('h2.hp__hotel-name, h1.pp-hotel-name-title')?.innerText?.trim(),
|
|
377
|
+
rating: document.querySelector('[data-testid="rating-squares"]')
|
|
378
|
+
? document.querySelectorAll('[data-testid="rating-squares"] svg').length
|
|
379
|
+
: null,
|
|
380
|
+
score: document.querySelector('[data-testid="review-score-right-component"] .ac4a7896c7')?.innerText
|
|
381
|
+
|| document.querySelector('[aria-label*="Scored"]')?.getAttribute('aria-label'),
|
|
382
|
+
address: document.querySelector('[data-testid="PropertyHeaderAddressDesktop"]')?.innerText?.trim()
|
|
383
|
+
|| document.querySelector('[id="hotel_address"]')?.innerText?.trim(),
|
|
384
|
+
description: document.querySelector('[data-testid="property-description-content"]')?.innerText?.trim()
|
|
385
|
+
|| document.querySelector('#property_description_content')?.innerText?.trim(),
|
|
386
|
+
amenities: Array.from(document.querySelectorAll('[data-testid="facility-list-item"]'))
|
|
387
|
+
.map(e => e.innerText?.trim()).filter(Boolean),
|
|
388
|
+
room_types: Array.from(document.querySelectorAll('[data-testid="roomstable-accordion"]'))
|
|
389
|
+
.map(el => ({
|
|
390
|
+
name: el.querySelector('[data-testid="room-type-name"]')?.innerText?.trim(),
|
|
391
|
+
price: el.querySelector('[data-testid="price-and-discounted-price"]')?.innerText?.trim(),
|
|
392
|
+
})),
|
|
393
|
+
lat: document.querySelector('a[href*="maps.google"]')
|
|
394
|
+
?.href?.match(/[?&]q=([^&]+)/)?.[1]?.split(',')[0],
|
|
395
|
+
lon: document.querySelector('a[href*="maps.google"]')
|
|
396
|
+
?.href?.match(/[?&]q=([^&]+)/)?.[1]?.split(',')[1],
|
|
397
|
+
})
|
|
398
|
+
""")
|
|
399
|
+
```
|
|
400
|
+
|
|
401
|
+
### JSON-LD Schema (Property Pages)
|
|
402
|
+
|
|
403
|
+
Property pages embed JSON-LD when fully rendered in browser. The schema type
|
|
404
|
+
is `Hotel`:
|
|
405
|
+
|
|
406
|
+
```python
|
|
407
|
+
ld_json = js("""
|
|
408
|
+
(function() {
|
|
409
|
+
for (var s of document.querySelectorAll('script[type="application/ld+json"]')) {
|
|
410
|
+
try {
|
|
411
|
+
var d = JSON.parse(s.textContent);
|
|
412
|
+
if (d['@type'] === 'Hotel' || d['@type'] === 'LodgingBusiness') return d;
|
|
413
|
+
} catch(e) {}
|
|
414
|
+
}
|
|
415
|
+
return null;
|
|
416
|
+
})()
|
|
417
|
+
""")
|
|
418
|
+
# Returns:
|
|
419
|
+
# {
|
|
420
|
+
# "@type": "Hotel",
|
|
421
|
+
# "name": "Hotel de Crillon",
|
|
422
|
+
# "aggregateRating": {"ratingValue": "9.2", "reviewCount": "1423"},
|
|
423
|
+
# "address": {"streetAddress": "10 Place de la Concorde", "addressLocality": "Paris", ...},
|
|
424
|
+
# "geo": {"latitude": 48.865, "longitude": 2.321},
|
|
425
|
+
# "starRating": {"ratingValue": 5}
|
|
426
|
+
# }
|
|
427
|
+
```
|
|
428
|
+
|
|
429
|
+
JSON-LD is **not present in the WAF stub** — it only exists in the fully
|
|
430
|
+
rendered page. `http_get` will never see it.
|
|
431
|
+
|
|
432
|
+
### Embedded JavaScript Data (`__NEXT_DATA__` / `b_hotel_data`)
|
|
433
|
+
|
|
434
|
+
Booking.com's React app may embed search state in `window.__NEXT_DATA__` or
|
|
435
|
+
legacy `b_hotel_data` globals. Access via:
|
|
436
|
+
|
|
437
|
+
```python
|
|
438
|
+
next_data = js("window.__NEXT_DATA__") # dict or None
|
|
439
|
+
b_hotel = js("window.b_hotel_data") # dict or None — legacy pages
|
|
440
|
+
```
|
|
441
|
+
|
|
442
|
+
These globals are not present in the WAF stub and their availability depends
|
|
443
|
+
on page version. Prefer data-testid selectors which are more stable.
|
|
444
|
+
|
|
445
|
+
---
|
|
446
|
+
|
|
447
|
+
## Pricing Extraction Patterns
|
|
448
|
+
|
|
449
|
+
Booking.com shows prices per night with multiple formatting variants:
|
|
450
|
+
|
|
451
|
+
```python
|
|
452
|
+
price_patterns = js("""
|
|
453
|
+
({
|
|
454
|
+
// Search results card price
|
|
455
|
+
search_price: document.querySelector('[data-testid="price-and-discounted-price"]')?.innerText,
|
|
456
|
+
// Property page room price
|
|
457
|
+
room_price: document.querySelector('[data-testid="price-and-discounted-price"]')?.innerText,
|
|
458
|
+
// Original (crossed-out) price before discount
|
|
459
|
+
original_price: document.querySelector('[data-testid="recommended-units-price"] s')?.innerText
|
|
460
|
+
|| document.querySelector('.prco-valign-middle-helper del')?.innerText,
|
|
461
|
+
// "Price for X nights" summary
|
|
462
|
+
total_price: document.querySelector('[data-testid="checkout-price-summary"]')?.innerText,
|
|
463
|
+
// Genius discount tag
|
|
464
|
+
genius_discount: document.querySelector('[data-testid="genius-rate-badge"]')?.innerText,
|
|
465
|
+
})
|
|
466
|
+
""")
|
|
467
|
+
```
|
|
468
|
+
|
|
469
|
+
**Price display nuances:**
|
|
470
|
+
- Prices shown are **per night** by default; multiply by nights for total.
|
|
471
|
+
- Currency is controlled by `selected_currency` URL param or user account
|
|
472
|
+
setting.
|
|
473
|
+
- Taxes/fees may or may not be included; look for `"Includes taxes and fees"`
|
|
474
|
+
or `"+ taxes & fees"` text adjacent to the price element.
|
|
475
|
+
- The `data-testid="price-and-discounted-price"` element returns a single
|
|
476
|
+
string that may contain both original and discounted price
|
|
477
|
+
(e.g., `"US$400\nUS$320"`).
|
|
478
|
+
|
|
479
|
+
---
|
|
480
|
+
|
|
481
|
+
## WAF Detection & Handling in Browser
|
|
482
|
+
|
|
483
|
+
The WAF resolves automatically in a real Chrome session. To detect if
|
|
484
|
+
something went wrong:
|
|
485
|
+
|
|
486
|
+
```python
|
|
487
|
+
def check_booking_waf():
|
|
488
|
+
url = page_info()["url"]
|
|
489
|
+
html_snippet = js("document.body?.innerHTML?.slice(0, 500)") or ""
|
|
490
|
+
return (
|
|
491
|
+
"chal_t=" in url
|
|
492
|
+
or "AwsWafIntegration" in html_snippet
|
|
493
|
+
or "challenge-container" in html_snippet
|
|
494
|
+
)
|
|
495
|
+
|
|
496
|
+
def wait_past_waf(timeout=15):
|
|
497
|
+
import time
|
|
498
|
+
deadline = time.time() + timeout
|
|
499
|
+
while time.time() < deadline:
|
|
500
|
+
if not check_booking_waf():
|
|
501
|
+
return True
|
|
502
|
+
wait(1)
|
|
503
|
+
return False # timed out — WAF didn't resolve
|
|
504
|
+
|
|
505
|
+
# Use after goto_url():
|
|
506
|
+
goto_url("https://www.booking.com/searchresults.html?ss=London&checkin=2026-06-01&checkout=2026-06-03&group_adults=2&no_rooms=1")
|
|
507
|
+
wait_for_load()
|
|
508
|
+
wait_past_waf()
|
|
509
|
+
wait(2) # hydration
|
|
510
|
+
```
|
|
511
|
+
|
|
512
|
+
---
|
|
513
|
+
|
|
514
|
+
## Sitemap-Based URL Discovery Workflow
|
|
515
|
+
|
|
516
|
+
Use this when you need a list of property URLs for a given country or city,
|
|
517
|
+
without needing to scrape search results pages in the browser:
|
|
518
|
+
|
|
519
|
+
```python
|
|
520
|
+
import gzip, re, urllib.request
|
|
521
|
+
|
|
522
|
+
GOOGLEBOT = {"User-Agent": "Googlebot/2.1 (+http://www.google.com/bot.html)"}
|
|
523
|
+
|
|
524
|
+
def get_hotel_urls_for_country(cc: str, lang: str = "en-gb", max_shards: int = 2) -> list[str]:
|
|
525
|
+
"""Returns property page URLs for a country from sitemaps. No browser needed."""
|
|
526
|
+
idx_url = f"https://www.booking.com/sitembk-hotel-index.xml"
|
|
527
|
+
idx = http_get(idx_url, headers=GOOGLEBOT)
|
|
528
|
+
pattern = rf'<loc>(https://www\.booking\.com/sitembk-hotel-{lang}\.\d+\.xml\.gz)</loc>'
|
|
529
|
+
shards = re.findall(pattern, idx)[:max_shards]
|
|
530
|
+
|
|
531
|
+
urls = []
|
|
532
|
+
for shard_url in shards:
|
|
533
|
+
req = urllib.request.Request(shard_url, headers=GOOGLEBOT)
|
|
534
|
+
with urllib.request.urlopen(req, timeout=60) as r:
|
|
535
|
+
xml = gzip.decompress(r.read()).decode()
|
|
536
|
+
all_urls = re.findall(r'<loc>(https://[^<]+)</loc>', xml)
|
|
537
|
+
# Filter by country code
|
|
538
|
+
country_urls = [u for u in all_urls if f"/hotel/{cc}/" in u]
|
|
539
|
+
urls.extend(country_urls)
|
|
540
|
+
return urls
|
|
541
|
+
|
|
542
|
+
# Example: get French hotel URLs (no browser needed, instant)
|
|
543
|
+
# french_hotels = get_hotel_urls_for_country("fr", max_shards=1)
|
|
544
|
+
# len(french_hotels) -> ~8,000+ URLs from one shard
|
|
545
|
+
```
|
|
546
|
+
|
|
547
|
+
---
|
|
548
|
+
|
|
549
|
+
## Gotchas
|
|
550
|
+
|
|
551
|
+
- **WAF blocks everything via `http_get`** — there is no User-Agent or header
|
|
552
|
+
combination that bypasses it. The challenge is cryptographic, not heuristic.
|
|
553
|
+
- **WAF has two page sizes** — ~3,962 bytes (newer SDK, no AJAX reporter) and
|
|
554
|
+
~8,410 bytes (older with error reporting). Both are equally blocked.
|
|
555
|
+
- **Sitemaps whitelist Googlebot UA** — `Googlebot/2.1` UA works for sitemap
|
|
556
|
+
XML/GZ files but NOT for hotel/city/search HTML pages.
|
|
557
|
+
- **GraphQL endpoint is unprotected** but useless without a valid Booking.com
|
|
558
|
+
session (irene service requires authentication for all substantive queries).
|
|
559
|
+
- **GraphQL op-name whitelist**: introspection (`__schema`) is blocked by
|
|
560
|
+
operation name restriction. Use field validation errors to probe the schema.
|
|
561
|
+
- **GDPR consent banner**: shown after WAF resolves, before React renders
|
|
562
|
+
search results. Must be dismissed (click `[data-testid="accept"]`) before
|
|
563
|
+
interacting with EU sessions. Non-EU IPs may not see it.
|
|
564
|
+
- **React hydration delay**: `wait_for_load()` fires before card data renders.
|
|
565
|
+
Always add 2-3s of `wait()` after `wait_for_load()`.
|
|
566
|
+
- **`sr-hotel` class is legacy** — Booking.com migrated to data-testid
|
|
567
|
+
attributes. Use `[data-testid="property-card"]`, not `.sr-hotel`.
|
|
568
|
+
- **Price parsing**: the price element often contains the full string
|
|
569
|
+
`"US$400\nUS$320"` when a discount applies. Split on `\n` and take the last
|
|
570
|
+
item for current price.
|
|
571
|
+
- **Offset pagination cap**: Booking caps results at 1,000 properties per
|
|
572
|
+
search (offset 0–975, rows=25). For cities with >1,000 properties, use
|
|
573
|
+
filters (`nflt`) to segment results.
|
|
574
|
+
- **Currency must be set via URL param**: `selected_currency=USD` in the search
|
|
575
|
+
URL; the cookie-based currency selection may not persist across navigation.
|
|
576
|
+
- **`dest_id` for cities**: Paris = `-1456928`, Amsterdam = `-2140479`,
|
|
577
|
+
London = `-2601889`. Negative integers indicate city-level destinations.
|
|
578
|
+
Get the ID by reading it from the URL after using `ss=` search.
|