@pencil-agent/nano-pencil 2.0.1 → 2.0.2
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- package/README.md +267 -267
- package/dist/build-meta.json +3 -3
- package/dist/core/export-html/AGENT.md +11 -11
- package/dist/core/export-html/template.css +971 -971
- package/dist/core/export-html/template.html +54 -54
- package/dist/core/model/custom-providers.js +1 -1
- package/dist/core/model-registry.js +5 -5
- package/dist/extensions/builtin/AGENT.md +115 -115
- package/dist/extensions/builtin/browser/AGENT.md +17 -17
- package/dist/extensions/builtin/browser/agent-workspace/agent_helpers.py +12 -12
- package/dist/extensions/builtin/browser/agent-workspace/domain-skills/amazon/product-search.md +198 -198
- package/dist/extensions/builtin/browser/agent-workspace/domain-skills/archive-org/scraping.md +341 -341
- package/dist/extensions/builtin/browser/agent-workspace/domain-skills/arxiv/scraping.md +311 -311
- package/dist/extensions/builtin/browser/agent-workspace/domain-skills/arxiv-bulk/scraping.md +333 -333
- package/dist/extensions/builtin/browser/agent-workspace/domain-skills/atlas/overview.md +70 -70
- package/dist/extensions/builtin/browser/agent-workspace/domain-skills/booking-com/scraping.md +578 -578
- package/dist/extensions/builtin/browser/agent-workspace/domain-skills/capterra/scraping.md +440 -440
- package/dist/extensions/builtin/browser/agent-workspace/domain-skills/centilebrain/generate-estimates.md +110 -110
- package/dist/extensions/builtin/browser/agent-workspace/domain-skills/coingecko/scraping.md +325 -325
- package/dist/extensions/builtin/browser/agent-workspace/domain-skills/coinmarketcap/scraping.md +463 -463
- package/dist/extensions/builtin/browser/agent-workspace/domain-skills/coursera/scraping.md +360 -360
- package/dist/extensions/builtin/browser/agent-workspace/domain-skills/craigslist/scraping.md +390 -390
- package/dist/extensions/builtin/browser/agent-workspace/domain-skills/crossref/scraping.md +568 -568
- package/dist/extensions/builtin/browser/agent-workspace/domain-skills/dev-to/scraping.md +323 -323
- package/dist/extensions/builtin/browser/agent-workspace/domain-skills/duckduckgo/scraping.md +349 -349
- package/dist/extensions/builtin/browser/agent-workspace/domain-skills/ebay/scraping.md +435 -435
- package/dist/extensions/builtin/browser/agent-workspace/domain-skills/etsy/scraping.md +506 -506
- package/dist/extensions/builtin/browser/agent-workspace/domain-skills/eventbrite/scraping.md +363 -363
- package/dist/extensions/builtin/browser/agent-workspace/domain-skills/expedia/automation.md +168 -168
- package/dist/extensions/builtin/browser/agent-workspace/domain-skills/facebook/groups.md +236 -236
- package/dist/extensions/builtin/browser/agent-workspace/domain-skills/facebook/pages.md +295 -295
- package/dist/extensions/builtin/browser/agent-workspace/domain-skills/framer/editor.md +108 -108
- package/dist/extensions/builtin/browser/agent-workspace/domain-skills/fred/scraping.md +493 -493
- package/dist/extensions/builtin/browser/agent-workspace/domain-skills/g2/scraping.md +580 -580
- package/dist/extensions/builtin/browser/agent-workspace/domain-skills/genius/scraping.md +511 -511
- package/dist/extensions/builtin/browser/agent-workspace/domain-skills/github/repo-actions.md +65 -65
- package/dist/extensions/builtin/browser/agent-workspace/domain-skills/github/scraping.md +184 -184
- package/dist/extensions/builtin/browser/agent-workspace/domain-skills/glassdoor/scraping.md +543 -543
- package/dist/extensions/builtin/browser/agent-workspace/domain-skills/gmail/compose.md +122 -122
- package/dist/extensions/builtin/browser/agent-workspace/domain-skills/goodreads/scraping.md +461 -461
- package/dist/extensions/builtin/browser/agent-workspace/domain-skills/gutenberg/scraping.md +383 -383
- package/dist/extensions/builtin/browser/agent-workspace/domain-skills/hackernews/scraping.md +243 -243
- package/dist/extensions/builtin/browser/agent-workspace/domain-skills/howlongtobeat/scraping.md +473 -473
- package/dist/extensions/builtin/browser/agent-workspace/domain-skills/imdb/scraping.md +271 -271
- package/dist/extensions/builtin/browser/agent-workspace/domain-skills/itch-io/scraping.md +436 -436
- package/dist/extensions/builtin/browser/agent-workspace/domain-skills/job-boards/indeed-glassdoor.md +1021 -1021
- package/dist/extensions/builtin/browser/agent-workspace/domain-skills/letterboxd/scraping.md +349 -349
- package/dist/extensions/builtin/browser/agent-workspace/domain-skills/linkedin/invitation-manager.md +109 -109
- package/dist/extensions/builtin/browser/agent-workspace/domain-skills/loom/folder-enumeration.md +170 -170
- package/dist/extensions/builtin/browser/agent-workspace/domain-skills/macrotrends/scraping.md +537 -537
- package/dist/extensions/builtin/browser/agent-workspace/domain-skills/medium/article-hydration.md +120 -120
- package/dist/extensions/builtin/browser/agent-workspace/domain-skills/medium/scraping.md +414 -414
- package/dist/extensions/builtin/browser/agent-workspace/domain-skills/metacritic/scraping.md +477 -477
- package/dist/extensions/builtin/browser/agent-workspace/domain-skills/musicbrainz/scraping.md +478 -478
- package/dist/extensions/builtin/browser/agent-workspace/domain-skills/nasa/scraping.md +339 -339
- package/dist/extensions/builtin/browser/agent-workspace/domain-skills/news-aggregation/multi-source.md +205 -205
- package/dist/extensions/builtin/browser/agent-workspace/domain-skills/open-library/scraping.md +472 -472
- package/dist/extensions/builtin/browser/agent-workspace/domain-skills/openalex/scraping.md +470 -470
- package/dist/extensions/builtin/browser/agent-workspace/domain-skills/openstreetmap/scraping.md +490 -490
- package/dist/extensions/builtin/browser/agent-workspace/domain-skills/package-registries/npm-pypi.md +478 -478
- package/dist/extensions/builtin/browser/agent-workspace/domain-skills/polymarket/scraping.md +234 -234
- package/dist/extensions/builtin/browser/agent-workspace/domain-skills/producthunt/scraping.md +307 -307
- package/dist/extensions/builtin/browser/agent-workspace/domain-skills/pubmed/scraping.md +421 -421
- package/dist/extensions/builtin/browser/agent-workspace/domain-skills/quora/scraping.md +364 -364
- package/dist/extensions/builtin/browser/agent-workspace/domain-skills/rawg/scraping.md +352 -352
- package/dist/extensions/builtin/browser/agent-workspace/domain-skills/reddit/scraping.md +124 -124
- package/dist/extensions/builtin/browser/agent-workspace/domain-skills/rest-countries/scraping.md +233 -233
- package/dist/extensions/builtin/browser/agent-workspace/domain-skills/sec-edgar/scraping.md +361 -361
- package/dist/extensions/builtin/browser/agent-workspace/domain-skills/shopify-admin/README.md +36 -36
- package/dist/extensions/builtin/browser/agent-workspace/domain-skills/shopify-admin/embedded-apps.md +72 -72
- package/dist/extensions/builtin/browser/agent-workspace/domain-skills/shopify-admin/knowledge-base.md +109 -109
- package/dist/extensions/builtin/browser/agent-workspace/domain-skills/shopify-admin/polaris-inputs.md +137 -137
- package/dist/extensions/builtin/browser/agent-workspace/domain-skills/soundcloud/scraping.md +362 -362
- package/dist/extensions/builtin/browser/agent-workspace/domain-skills/spotify/scraping.md +339 -339
- package/dist/extensions/builtin/browser/agent-workspace/domain-skills/stackoverflow/scraping.md +435 -435
- package/dist/extensions/builtin/browser/agent-workspace/domain-skills/steam/scraping.md +575 -575
- package/dist/extensions/builtin/browser/agent-workspace/domain-skills/substack/scraping.md +338 -338
- package/dist/extensions/builtin/browser/agent-workspace/domain-skills/thetechgeeks/pricing.md +52 -52
- package/dist/extensions/builtin/browser/agent-workspace/domain-skills/tiktok/upload.md +107 -107
- package/dist/extensions/builtin/browser/agent-workspace/domain-skills/tradingview/scraping.md +309 -309
- package/dist/extensions/builtin/browser/agent-workspace/domain-skills/trello/boards-and-lists.md +88 -88
- package/dist/extensions/builtin/browser/agent-workspace/domain-skills/trustpilot/scraping.md +375 -375
- package/dist/extensions/builtin/browser/agent-workspace/domain-skills/walmart/scraping.md +444 -444
- package/dist/extensions/builtin/browser/agent-workspace/domain-skills/wayback-machine/scraping.md +306 -306
- package/dist/extensions/builtin/browser/agent-workspace/domain-skills/weather/scraping.md +398 -398
- package/dist/extensions/builtin/browser/agent-workspace/domain-skills/wellfound/scraping.md +596 -596
- package/dist/extensions/builtin/browser/agent-workspace/domain-skills/world-bank/scraping.md +356 -356
- package/dist/extensions/builtin/browser/agent-workspace/domain-skills/xiaohongshu/scraping.md +84 -84
- package/dist/extensions/builtin/browser/agent-workspace/domain-skills/youtube/scraping.md +418 -418
- package/dist/extensions/builtin/browser/agent-workspace/domain-skills/zillow/scraping.md +433 -433
- package/dist/extensions/builtin/browser/browser.md +73 -73
- package/dist/extensions/builtin/browser/install.md +142 -142
- package/dist/extensions/builtin/browser/interaction-skills/connection.md +48 -48
- package/dist/extensions/builtin/browser/interaction-skills/cookies.md +3 -3
- package/dist/extensions/builtin/browser/interaction-skills/cross-origin-iframes.md +3 -3
- package/dist/extensions/builtin/browser/interaction-skills/dialogs.md +64 -64
- package/dist/extensions/builtin/browser/interaction-skills/downloads.md +3 -3
- package/dist/extensions/builtin/browser/interaction-skills/drag-and-drop.md +3 -3
- package/dist/extensions/builtin/browser/interaction-skills/dropdowns.md +3 -3
- package/dist/extensions/builtin/browser/interaction-skills/iframes.md +3 -3
- package/dist/extensions/builtin/browser/interaction-skills/network-requests.md +3 -3
- package/dist/extensions/builtin/browser/interaction-skills/print-as-pdf.md +3 -3
- package/dist/extensions/builtin/browser/interaction-skills/profile-sync.md +90 -90
- package/dist/extensions/builtin/browser/interaction-skills/screenshots.md +17 -17
- package/dist/extensions/builtin/browser/interaction-skills/scrolling.md +3 -3
- package/dist/extensions/builtin/browser/interaction-skills/shadow-dom.md +3 -3
- package/dist/extensions/builtin/browser/interaction-skills/tabs.md +69 -69
- package/dist/extensions/builtin/browser/interaction-skills/uploads.md +1 -1
- package/dist/extensions/builtin/browser/interaction-skills/viewport.md +3 -3
- package/dist/extensions/builtin/browser/src/browser_harness/AGENT.md +15 -15
- package/dist/extensions/builtin/browser/src/browser_harness/__init__.py +8 -8
- package/dist/extensions/builtin/browser/src/browser_harness/_ipc.py +90 -90
- package/dist/extensions/builtin/browser/src/browser_harness/admin.py +722 -722
- package/dist/extensions/builtin/browser/src/browser_harness/daemon.py +328 -328
- package/dist/extensions/builtin/browser/src/browser_harness/helpers.py +396 -396
- package/dist/extensions/builtin/browser/src/browser_harness/run.py +103 -103
- package/dist/extensions/builtin/discipline/skills/brainstorming/SKILL.md +33 -33
- package/dist/extensions/builtin/discipline/skills/executing-plans/SKILL.md +25 -25
- package/dist/extensions/builtin/discipline/skills/finishing-development-branch/SKILL.md +25 -25
- package/dist/extensions/builtin/discipline/skills/receiving-code-review/SKILL.md +22 -22
- package/dist/extensions/builtin/discipline/skills/requesting-code-review/SKILL.md +31 -31
- package/dist/extensions/builtin/discipline/skills/systematic-debugging/SKILL.md +28 -28
- package/dist/extensions/builtin/discipline/skills/test-driven-development/SKILL.md +32 -32
- package/dist/extensions/builtin/discipline/skills/using-git-worktrees/SKILL.md +25 -25
- package/dist/extensions/builtin/discipline/skills/verification-before-completion/SKILL.md +27 -27
- package/dist/extensions/builtin/discipline/skills/writing-plans/SKILL.md +26 -26
- package/dist/extensions/builtin/goal/README.md +67 -67
- package/dist/extensions/builtin/grub/README.md +112 -112
- package/dist/extensions/builtin/link-world/agent-workspace/README.md +16 -16
- package/dist/extensions/builtin/link-world/internet-search/internet-search.md +65 -65
- package/dist/extensions/builtin/link-world/link-world-agent.md +82 -82
- package/dist/extensions/builtin/link-world/linkworld.md +313 -313
- package/dist/extensions/builtin/link-world/network-routing/network-routing.md +67 -67
- package/dist/extensions/builtin/loop/README.md +92 -92
- package/dist/extensions/builtin/mcp/figma-design.md +68 -68
- package/dist/extensions/builtin/mcp/mcp-management.md +85 -85
- package/dist/extensions/builtin/recap/AGENT.md +15 -15
- package/dist/extensions/builtin/sal/README.md +72 -72
- package/dist/extensions/builtin/security-audit/README.md +289 -289
- package/dist/extensions/builtin/team/AGENT.md +112 -112
- package/dist/extensions/builtin/team/TESTING.md +299 -299
- package/dist/extensions/builtin/token-save/README.md +56 -56
- package/dist/extensions/optional/AGENT.md +10 -10
- package/dist/modes/interactive/controllers/input-submit-controller.js +2 -2
- package/dist/modes/interactive/controllers/stream-render-controller.js +2 -2
- package/dist/modes/interactive/interactive-mode.js +19 -19
- package/dist/modes/interactive/theme/dark.json +85 -85
- package/dist/modes/interactive/theme/light.json +84 -84
- package/dist/modes/interactive/theme/theme-schema.json +335 -335
- package/dist/modes/interactive/theme/warm.json +81 -81
- package/dist/node_modules/@pencil-agent/ai/dist/cli.js +0 -0
- package/dist/node_modules/@pencil-agent/ai/dist/models.generated.js +1 -1
- package/docs/ACP/345/215/217/350/256/256/351/233/206/346/210/220/345/274/200/345/217/221/346/226/207/346/241/243.md +851 -0
- package/docs/SDK-TESTING.md +364 -0
- package/docs/codex-goal-command-impl.md +1055 -1055
- package/docs/codex-goal-vs-grub.md +500 -500
- package/docs/custom-provider.md +27 -27
- package/docs/extensions.md +27 -27
- package/docs/keybindings.md +27 -27
- package/docs/loop /351/207/215/346/236/204/345/256/214/346/210/220/346/200/273/347/273/223.md" +250 -250
- package/docs/loop /351/207/215/346/236/204/345/256/214/346/210/220/346/212/245/345/221/212.md" +122 -122
- package/docs/loop /351/207/215/346/236/204/346/226/271/346/241/210.md" +1222 -1222
- package/docs/loop /351/207/215/346/236/204/346/226/271/346/241/210/345/256/236/347/216/260/346/212/245/345/221/212.md" +158 -158
- package/docs/loop /351/207/215/346/236/204/346/226/271/346/241/210/345/257/271/346/257/224/345/210/206/346/236/220.md" +128 -128
- package/docs/loop /351/207/215/346/236/204/350/256/241/345/210/222.md" +320 -320
- package/docs/loop-usage-examples.md +214 -214
- package/docs/mem-core/346/212/200/346/234/257/346/226/207/346/241/243.md +593 -0
- package/docs/models.md +27 -27
- package/docs/packages.md +27 -27
- package/docs/pi-design-philosophy.md +457 -457
- package/docs/planmode.md +1987 -1987
- package/docs/prompt-templates.md +27 -27
- package/docs/providers.md +27 -27
- package/docs/sdk.md +27 -27
- package/docs/skills.md +27 -27
- package/docs/startup-performance-optimization.md +301 -0
- package/docs/themes.md +27 -27
- package/docs/tui.md +27 -27
- package/docs//350/256/244/347/237/245/345/234/260/345/233/276.md +47 -0
- package/package.json +190 -190
- package/docs/cc-agent-design.md +0 -1297
- package/docs/cc-tui-design.md +0 -1333
- package/docs/nanoPencil-/345/255/246/344/271/240/350/256/241/345/210/222.md +0 -170
- package/docs/scan-report.md +0 -3820
- package/docs//345/257/271/346/240/207Claude-Code.md +0 -1775
- package/docs//351/230/277/351/207/214/345/267/264/345/267/264/350/264/242/346/212/245/345/210/206/346/236/220/344/271/246.md +0 -261
|
@@ -1,433 +1,433 @@
|
|
|
1
|
-
# Zillow — Scraping & Data Extraction
|
|
2
|
-
|
|
3
|
-
Field-tested against `www.zillow.com` on 2026-04-18 using `http_get` (no browser).
|
|
4
|
-
|
|
5
|
-
## Quick summary
|
|
6
|
-
|
|
7
|
-
- **Search listing pages (`/homes/`, `/sold/`, `/rentals/`)** — `http_get` works with full Chrome headers. Returns ~973 KB HTML with all listing data embedded in `__NEXT_DATA__` JSON.
|
|
8
|
-
- **Individual property detail pages (`/homedetails/`)** — `http_get` returns **HTTP 403** unconditionally. No header combination bypasses this.
|
|
9
|
-
- **Internal API endpoints** (`/async-create-search-page-state`, `/graphql/`) — **403** for all server-side requests regardless of headers.
|
|
10
|
-
- **Redfin** — `http_get` works; HTML contains both JSON-LD per listing and a stingray JSON API.
|
|
11
|
-
|
|
12
|
-
---
|
|
13
|
-
|
|
14
|
-
## What works: search listing pages via `__NEXT_DATA__`
|
|
15
|
-
|
|
16
|
-
Zillow search pages embed all listing data in `<script id="__NEXT_DATA__">`. This is standard Next.js SSR output — it is the same data Zillow's React app hydrates from.
|
|
17
|
-
|
|
18
|
-
**Required headers** — The single-word User-Agent (`"Mozilla/5.0"`) used by `http_get` internally gets 403. You must pass a full Chrome UA plus Accept/Accept-Language headers:
|
|
19
|
-
|
|
20
|
-
```python
|
|
21
|
-
import re, json
|
|
22
|
-
from helpers import http_get
|
|
23
|
-
|
|
24
|
-
HEADERS = {
|
|
25
|
-
"User-Agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/120.0.0.0 Safari/537.36",
|
|
26
|
-
"Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8",
|
|
27
|
-
"Accept-Language": "en-US,en;q=0.9",
|
|
28
|
-
}
|
|
29
|
-
|
|
30
|
-
def extract_listings(html):
|
|
31
|
-
"""Parse Zillow __NEXT_DATA__ and return list of listing dicts."""
|
|
32
|
-
m = re.search(r'<script id="__NEXT_DATA__"[^>]*>(.*?)</script>', html, re.DOTALL)
|
|
33
|
-
if not m:
|
|
34
|
-
return []
|
|
35
|
-
d = json.loads(m.group(1))
|
|
36
|
-
sps = d['props']['pageProps']['searchPageState']
|
|
37
|
-
return sps['cat1']['searchResults']['listResults']
|
|
38
|
-
|
|
39
|
-
html = http_get("https://www.zillow.com/homes/San-Francisco,-CA_rb/", headers=HEADERS)
|
|
40
|
-
listings = extract_listings(html)
|
|
41
|
-
print(len(listings)) # 41 — always 41 per page
|
|
42
|
-
```
|
|
43
|
-
|
|
44
|
-
### Fields available in each listing card
|
|
45
|
-
|
|
46
|
-
The `listResults` array is the canonical source. Each entry includes:
|
|
47
|
-
|
|
48
|
-
| Field | Source | Example |
|
|
49
|
-
|---|---|---|
|
|
50
|
-
| `zpid` | listing | `15081707` |
|
|
51
|
-
| `address` | listing | `"212 Spruce St, San Francisco, CA 94118"` |
|
|
52
|
-
| `addressStreet`, `addressCity`, `addressState`, `addressZipcode` | listing | split address components |
|
|
53
|
-
| `price` | listing | `"$4,395,000"` (formatted string) |
|
|
54
|
-
| `unformattedPrice` | listing | `4395000` (int, use for math) |
|
|
55
|
-
| `beds` | listing | `4` |
|
|
56
|
-
| `baths` | listing | `4` |
|
|
57
|
-
| `area` | listing | `4133` (sqft) |
|
|
58
|
-
| `latLong` | listing | `{'latitude': 37.78867, 'longitude': -122.45361}` |
|
|
59
|
-
| `statusType` | listing | `"FOR_SALE"` / `"FOR_RENT"` / `"RECENTLY_SOLD"` |
|
|
60
|
-
| `detailUrl` | listing | full `https://www.zillow.com/homedetails/...` URL |
|
|
61
|
-
| `zestimate` | listing | `4857200` (Zillow AI estimate, int) |
|
|
62
|
-
| `imgSrc` | listing | thumbnail URL |
|
|
63
|
-
| `has3DModel` | listing | `True`/`False` |
|
|
64
|
-
| `hasOpenHouse` | listing | `True`/`False` |
|
|
65
|
-
| `openHouseStartDate`, `openHouseEndDate` | listing | ISO strings |
|
|
66
|
-
| `isFeaturedListing` | listing | sponsored/featured flag |
|
|
67
|
-
| `brokerName` | listing | `"Sotheby's International Realty"` |
|
|
68
|
-
| `statusText` | listing | `"FOR SALE"` display string |
|
|
69
|
-
| `hdpData.homeInfo.price` | nested | raw price int (matches `unformattedPrice`) |
|
|
70
|
-
| `hdpData.homeInfo.zestimate` | nested | raw zestimate int |
|
|
71
|
-
| `hdpData.homeInfo.rentZestimate` | nested | monthly rent estimate |
|
|
72
|
-
| `hdpData.homeInfo.homeType` | nested | `"SINGLE_FAMILY"`, `"CONDO"`, `"TOWNHOUSE"` etc. |
|
|
73
|
-
| `hdpData.homeInfo.daysOnZillow` | nested | int |
|
|
74
|
-
| `hdpData.homeInfo.taxAssessedValue` | nested | int |
|
|
75
|
-
| `hdpData.homeInfo.lotAreaValue` + `lotAreaUnit` | nested | e.g. `2957.724`, `"sqft"` |
|
|
76
|
-
| `hdpData.homeInfo.priceForHDP` | nested | reliable sold price for recently-sold listings |
|
|
77
|
-
|
|
78
|
-
```python
|
|
79
|
-
# Full extraction snippet
|
|
80
|
-
listing = listings[0]
|
|
81
|
-
hi = listing.get('hdpData', {}).get('homeInfo', {})
|
|
82
|
-
|
|
83
|
-
record = {
|
|
84
|
-
"zpid": listing['zpid'],
|
|
85
|
-
"address": listing['address'],
|
|
86
|
-
"price_raw": listing.get('unformattedPrice') or hi.get('price'),
|
|
87
|
-
"beds": listing.get('beds'),
|
|
88
|
-
"baths": listing.get('baths'),
|
|
89
|
-
"sqft": listing.get('area'),
|
|
90
|
-
"lat": listing['latLong']['latitude'],
|
|
91
|
-
"lon": listing['latLong']['longitude'],
|
|
92
|
-
"status": listing['statusType'],
|
|
93
|
-
"zestimate": listing.get('zestimate'),
|
|
94
|
-
"rent_zest": hi.get('rentZestimate'),
|
|
95
|
-
"home_type": hi.get('homeType'),
|
|
96
|
-
"days_listed": hi.get('daysOnZillow'),
|
|
97
|
-
"tax_assessed": hi.get('taxAssessedValue'),
|
|
98
|
-
"url": listing['detailUrl'],
|
|
99
|
-
}
|
|
100
|
-
```
|
|
101
|
-
|
|
102
|
-
### Total result count and pagination
|
|
103
|
-
|
|
104
|
-
```python
|
|
105
|
-
d = json.loads(re.search(r'<script id="__NEXT_DATA__"[^>]*>(.*?)</script>', html, re.DOTALL).group(1))
|
|
106
|
-
sps = d['props']['pageProps']['searchPageState']
|
|
107
|
-
|
|
108
|
-
# Total listings in this search
|
|
109
|
-
total = sps['categoryTotals']['cat1']['totalResultCount']
|
|
110
|
-
print(total) # 1037
|
|
111
|
-
|
|
112
|
-
# Each page returns exactly 41 listings. Add /<N>_p/ for subsequent pages:
|
|
113
|
-
# Page 2: https://www.zillow.com/homes/San-Francisco,-CA_rb/2_p/
|
|
114
|
-
# Page 3: https://www.zillow.com/homes/San-Francisco,-CA_rb/3_p/
|
|
115
|
-
|
|
116
|
-
max_pages = (total + 40) // 41
|
|
117
|
-
```
|
|
118
|
-
|
|
119
|
-
### Scrape all pages
|
|
120
|
-
|
|
121
|
-
```python
|
|
122
|
-
import re, json, time
|
|
123
|
-
from helpers import http_get
|
|
124
|
-
|
|
125
|
-
HEADERS = {
|
|
126
|
-
"User-Agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/120.0.0.0 Safari/537.36",
|
|
127
|
-
"Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8",
|
|
128
|
-
"Accept-Language": "en-US,en;q=0.9",
|
|
129
|
-
}
|
|
130
|
-
|
|
131
|
-
def get_listings(city_slug, page=1):
|
|
132
|
-
"""city_slug: e.g. 'San-Francisco,-CA', 'Seattle,-WA', 'Austin,-TX'"""
|
|
133
|
-
if page == 1:
|
|
134
|
-
url = f"https://www.zillow.com/homes/{city_slug}_rb/"
|
|
135
|
-
else:
|
|
136
|
-
url = f"https://www.zillow.com/homes/{city_slug}_rb/{page}_p/"
|
|
137
|
-
html = http_get(url, headers=HEADERS)
|
|
138
|
-
m = re.search(r'<script id="__NEXT_DATA__"[^>]*>(.*?)</script>', html, re.DOTALL)
|
|
139
|
-
d = json.loads(m.group(1))
|
|
140
|
-
sps = d['props']['pageProps']['searchPageState']
|
|
141
|
-
total = sps['categoryTotals']['cat1']['totalResultCount']
|
|
142
|
-
listings = sps['cat1']['searchResults']['listResults']
|
|
143
|
-
return listings, total
|
|
144
|
-
|
|
145
|
-
all_listings = []
|
|
146
|
-
listings, total = get_listings("San-Francisco,-CA")
|
|
147
|
-
all_listings.extend(listings)
|
|
148
|
-
|
|
149
|
-
max_pages = (total + 40) // 41
|
|
150
|
-
for page in range(2, min(max_pages + 1, 6)): # cap at 5 pages for demo
|
|
151
|
-
time.sleep(1.0) # polite delay
|
|
152
|
-
page_listings, _ = get_listings("San-Francisco,-CA", page)
|
|
153
|
-
all_listings.extend(page_listings)
|
|
154
|
-
|
|
155
|
-
print(f"Fetched {len(all_listings)} of {total} listings")
|
|
156
|
-
```
|
|
157
|
-
|
|
158
|
-
---
|
|
159
|
-
|
|
160
|
-
## URL patterns that work (all confirmed)
|
|
161
|
-
|
|
162
|
-
| URL pattern | Status | Notes |
|
|
163
|
-
|---|---|---|
|
|
164
|
-
| `/homes/{city}_rb/` | **Works** | For-sale listings |
|
|
165
|
-
| `/homes/{city}_rb/{N}_p/` | **Works** | Pagination |
|
|
166
|
-
| `/homes/for_sale/{city}/0-1800000_price/` | **Works** | Price filter (max) |
|
|
167
|
-
| `/homes/3-_beds/{city}/` | **Works** | Bed count filter |
|
|
168
|
-
| `/homes/{zip}_rb/` | **Works** | ZIP code search |
|
|
169
|
-
| `/san-francisco-ca/rentals/` | **Works** | Rental listings |
|
|
170
|
-
| `/san-francisco-ca/sold/` | **Works** | Recently sold |
|
|
171
|
-
| `/homedetails/{address}/{zpid}_zpid/` | **403** | Single property detail |
|
|
172
|
-
| `/async-create-search-page-state` | **403** | Internal search API |
|
|
173
|
-
| `/graphql/` | **400/403** | GraphQL endpoint |
|
|
174
|
-
|
|
175
|
-
---
|
|
176
|
-
|
|
177
|
-
## Rental listings
|
|
178
|
-
|
|
179
|
-
Rental search pages use the same `__NEXT_DATA__` structure. However, rental listing cards have a **different schema** — individual units are nested, not a flat price:
|
|
180
|
-
|
|
181
|
-
```python
|
|
182
|
-
html = http_get("https://www.zillow.com/san-francisco-ca/rentals/", headers=HEADERS)
|
|
183
|
-
listings = extract_listings(html)
|
|
184
|
-
|
|
185
|
-
r = listings[0]
|
|
186
|
-
# Multi-unit buildings:
|
|
187
|
-
# r['units'] = [{'price': '$3,485+', 'beds': '0', 'roomForRent': False}, ...]
|
|
188
|
-
# r['minBaseRent'] = 3485
|
|
189
|
-
# r['maxBaseRent'] = 7130
|
|
190
|
-
# r['availabilityCount'] = 23
|
|
191
|
-
|
|
192
|
-
# Single-unit rentals:
|
|
193
|
-
# r['price'] = '$2,500/mo'
|
|
194
|
-
# r['unformattedPrice'] = 2500
|
|
195
|
-
|
|
196
|
-
# Check which type:
|
|
197
|
-
if r.get('isBuilding'):
|
|
198
|
-
price_range = f"${r['minBaseRent']}–${r['maxBaseRent']}/mo"
|
|
199
|
-
units = r.get('units', [])
|
|
200
|
-
else:
|
|
201
|
-
price = r.get('unformattedPrice') or r.get('hdpData', {}).get('homeInfo', {}).get('price')
|
|
202
|
-
```
|
|
203
|
-
|
|
204
|
-
---
|
|
205
|
-
|
|
206
|
-
## Sold listings
|
|
207
|
-
|
|
208
|
-
Sold pages (`/sold/`) work identically. Key difference: `statusType` is `"RECENTLY_SOLD"` and price comes from `hdpData.homeInfo.priceForHDP` (not the `price` field which is `None` in sold cards):
|
|
209
|
-
|
|
210
|
-
```python
|
|
211
|
-
html = http_get("https://www.zillow.com/san-francisco-ca/sold/", headers=HEADERS)
|
|
212
|
-
listings = extract_listings(html)
|
|
213
|
-
|
|
214
|
-
for l in listings:
|
|
215
|
-
hi = l.get('hdpData', {}).get('homeInfo', {})
|
|
216
|
-
sold_price = hi.get('priceForHDP') # actual sold price
|
|
217
|
-
zestimate = hi.get('zestimate')
|
|
218
|
-
tax_value = hi.get('taxAssessedValue')
|
|
219
|
-
print(l['address'], f"${sold_price:,}", f"zest=${zestimate}")
|
|
220
|
-
# 999 Green St APT 1702, San Francisco, CA 94133 $3,200,000 zest=$3,403,400
|
|
221
|
-
# 1041 Vallejo St, San Francisco, CA 94133 $6,250,000 zest=None
|
|
222
|
-
```
|
|
223
|
-
|
|
224
|
-
Total sold inventory in San Francisco: **18,109** (all time in Zillow's database, paginated 41/page).
|
|
225
|
-
|
|
226
|
-
---
|
|
227
|
-
|
|
228
|
-
## Bot detection behavior
|
|
229
|
-
|
|
230
|
-
- **Zillow detects bot status server-side** and embeds `window.__USER_SESSION_INITIAL_STATE__` and `props.isBot` in the page.
|
|
231
|
-
- In field testing, the page returned `isBot: False` with the Chrome User-Agent — **Zillow does not block the search pages**.
|
|
232
|
-
- The page does embed `captcha` strings in the HTML (for the CAPTCHA challenge widget code), but the challenge is NOT triggered for search pages.
|
|
233
|
-
- **`/homedetails/` pages do trigger blocking** — every property detail URL tested returned HTTP 403. This is enforced before serving HTML, not via JavaScript CAPTCHA.
|
|
234
|
-
- Rate limiting: 3 rapid sequential requests to `/homes/` all succeeded. Observed no 429s. Add `time.sleep(0.5–1.0)` between pages as a courtesy.
|
|
235
|
-
|
|
236
|
-
---
|
|
237
|
-
|
|
238
|
-
## What you do NOT get from `http_get`
|
|
239
|
-
|
|
240
|
-
Because property detail pages are blocked (403), you lose:
|
|
241
|
-
|
|
242
|
-
- Full property description text
|
|
243
|
-
- All listing photos (you only get `imgSrc` thumbnail from search)
|
|
244
|
-
- Detailed home facts (year built, parking, HVAC, school scores)
|
|
245
|
-
- Price history
|
|
246
|
-
- Nearby comparable sales (comps)
|
|
247
|
-
- Agent contact info
|
|
248
|
-
|
|
249
|
-
**To get these**, you must navigate to the `/homedetails/` URL in a browser session. The browser is not blocked (Zillow relies on JS challenges and fingerprinting that only trigger in browser context).
|
|
250
|
-
|
|
251
|
-
---
|
|
252
|
-
|
|
253
|
-
## Alternative: Redfin (field-tested, more accessible)
|
|
254
|
-
|
|
255
|
-
Redfin allows `http_get` with no blocking for both HTML pages and its internal API.
|
|
256
|
-
|
|
257
|
-
### Redfin JSON-LD per listing (easiest)
|
|
258
|
-
|
|
259
|
-
Each Redfin search results page embeds one `<script type="application/ld+json">` per listing with structured property data:
|
|
260
|
-
|
|
261
|
-
```python
|
|
262
|
-
import re, json
|
|
263
|
-
from helpers import http_get
|
|
264
|
-
|
|
265
|
-
HEADERS = {
|
|
266
|
-
"User-Agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/120.0.0.0 Safari/537.36",
|
|
267
|
-
"Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8",
|
|
268
|
-
"Accept-Language": "en-US,en;q=0.9",
|
|
269
|
-
}
|
|
270
|
-
|
|
271
|
-
html = http_get(
|
|
272
|
-
"https://www.redfin.com/city/17151/CA/San-Francisco/filter/property-type=house",
|
|
273
|
-
headers=HEADERS
|
|
274
|
-
)
|
|
275
|
-
print(len(html)) # ~1.6 MB
|
|
276
|
-
|
|
277
|
-
# Extract all SingleFamilyResidence JSON-LD entries
|
|
278
|
-
properties = []
|
|
279
|
-
for s in re.findall(r'<script type="application/ld\+json">(.*?)</script>', html, re.DOTALL):
|
|
280
|
-
try:
|
|
281
|
-
d = json.loads(s)
|
|
282
|
-
if isinstance(d, list):
|
|
283
|
-
for item in d:
|
|
284
|
-
if item.get('@type') in ('SingleFamilyResidence', 'House', 'Residence', 'Apartment'):
|
|
285
|
-
properties.append(item)
|
|
286
|
-
except Exception:
|
|
287
|
-
pass
|
|
288
|
-
|
|
289
|
-
prop = properties[0]
|
|
290
|
-
print("Name:", prop['name']) # "662 Hampshire St, San Francisco, CA 94110"
|
|
291
|
-
print("Address:", prop['address'])
|
|
292
|
-
# {'@type': 'PostalAddress', 'streetAddress': '662 Hampshire St',
|
|
293
|
-
# 'addressLocality': 'San Francisco', 'addressRegion': 'CA',
|
|
294
|
-
# 'postalCode': '94110', 'addressCountry': 'US'}
|
|
295
|
-
print("Rooms:", prop['numberOfRooms']) # 3
|
|
296
|
-
print("Floor size:", prop['floorSize']) # {'@type': 'QuantitativeValue', 'value': 3350, 'unitCode': 'FTK'}
|
|
297
|
-
print("URL:", prop['url'])
|
|
298
|
-
# https://www.redfin.com/CA/San-Francisco/662-Hampshire-St-94110/home/1533754
|
|
299
|
-
```
|
|
300
|
-
|
|
301
|
-
Note: The JSON-LD schema does NOT include price (Redfin omits `offers` from the LD+JSON). Use the stingray API below for price.
|
|
302
|
-
|
|
303
|
-
### Redfin stingray API (structured JSON with price)
|
|
304
|
-
|
|
305
|
-
Redfin's internal GIS/search API returns rich structured data including price, MLS ID, beds, baths, sqft, agent info, and remarks. Responses are prefixed with `{}&&` — strip it before parsing:
|
|
306
|
-
|
|
307
|
-
```python
|
|
308
|
-
import json
|
|
309
|
-
from helpers import http_get
|
|
310
|
-
|
|
311
|
-
def redfin_search(region_id, region_type=6, num_homes=20, page=1, uipt="1,2,3,4,5,6"):
|
|
312
|
-
"""
|
|
313
|
-
region_type: 6=city, 2=zipcode, 5=county
|
|
314
|
-
uipt: property types (1=house, 2=condo, 3=townhouse, 4=multi-family, 5=land, 6=other)
|
|
315
|
-
"""
|
|
316
|
-
url = (
|
|
317
|
-
f"https://www.redfin.com/stingray/api/gis"
|
|
318
|
-
f"?al=1&num_homes={num_homes}&ord=redfin-recommended-asc"
|
|
319
|
-
f"&page_number={page}®ion_id={region_id}®ion_type={region_type}"
|
|
320
|
-
f"&sf=1,2,3,5,6,7&status=9&uipt={uipt}&v=8"
|
|
321
|
-
)
|
|
322
|
-
raw = http_get(url, headers={
|
|
323
|
-
"User-Agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/120.0.0.0 Safari/537.36",
|
|
324
|
-
"Referer": "https://www.redfin.com/",
|
|
325
|
-
"Accept": "*/*",
|
|
326
|
-
})
|
|
327
|
-
# Strip the {}&& CSRF prefix Redfin prepends to all API responses
|
|
328
|
-
assert raw.startswith('{}&&'), f"Unexpected prefix: {raw[:10]}"
|
|
329
|
-
return json.loads(raw[4:])
|
|
330
|
-
|
|
331
|
-
data = redfin_search(region_id=17151) # 17151 = San Francisco, CA
|
|
332
|
-
homes = data['payload']['homes']
|
|
333
|
-
|
|
334
|
-
home = homes[0]
|
|
335
|
-
print("Address:", home['streetLine']['value']) # "875 California St #703"
|
|
336
|
-
print("City/State/Zip:", home['city'], home['state'], home['zip'])
|
|
337
|
-
print("Price:", home['price']['value']) # 3300000
|
|
338
|
-
print("Beds:", home['beds']) # 3
|
|
339
|
-
print("Baths:", home['baths']) # 2.5
|
|
340
|
-
print("Sqft:", home['sqFt']['value']) # 1828
|
|
341
|
-
print("$/sqft:", home['pricePerSqFt']['value']) # 1805
|
|
342
|
-
print("Lot size:", home['lotSize']['value']) # 9448
|
|
343
|
-
print("Year built:", home['yearBuilt']['value']) # 2021
|
|
344
|
-
print("Days on market:", home['dom']['value']) # 1
|
|
345
|
-
print("MLS ID:", home['mlsId']['value']) # "426115342"
|
|
346
|
-
print("MLS Status:", home['mlsStatus']) # "Active"
|
|
347
|
-
print("Lat/Long:", home['latLong']['value'])
|
|
348
|
-
print("URL:", home['url']) # "/CA/San-Francisco/..."
|
|
349
|
-
print("Remarks:", home['listingRemarks'][:100])
|
|
350
|
-
```
|
|
351
|
-
|
|
352
|
-
### Redfin region IDs
|
|
353
|
-
|
|
354
|
-
| City | region_id | region_type |
|
|
355
|
-
|---|---|---|
|
|
356
|
-
| San Francisco, CA | `17151` | `6` (city) |
|
|
357
|
-
| Los Angeles, CA | `17152` | `6` |
|
|
358
|
-
| New York, NY | `17834` | `6` |
|
|
359
|
-
| Seattle, WA | `16163` | `6` |
|
|
360
|
-
|
|
361
|
-
To find other region IDs: search on Redfin, look at the URL (e.g. `/city/17151/CA/San-Francisco`) — the number is the region_id.
|
|
362
|
-
|
|
363
|
-
### Redfin stingray response structure
|
|
364
|
-
|
|
365
|
-
```
|
|
366
|
-
data['payload']['homes'][i]
|
|
367
|
-
.streetLine.value → street address string
|
|
368
|
-
.city / .state / .zip → strings
|
|
369
|
-
.price.value → int (asking price in dollars)
|
|
370
|
-
.sqFt.value → int (square feet)
|
|
371
|
-
.pricePerSqFt.value → int
|
|
372
|
-
.beds → int
|
|
373
|
-
.baths → float (2.5 = 2 full + 1 half)
|
|
374
|
-
.fullBaths / .partialBaths → ints
|
|
375
|
-
.lotSize.value → int (sq ft)
|
|
376
|
-
.yearBuilt.value → int
|
|
377
|
-
.dom.value → days on market (int)
|
|
378
|
-
.mlsId.value → MLS listing number (string)
|
|
379
|
-
.mlsStatus → "Active", "Pending", etc.
|
|
380
|
-
.listingId → Redfin internal int
|
|
381
|
-
.propertyId → Redfin internal int
|
|
382
|
-
.latLong.value → {'latitude': float, 'longitude': float}
|
|
383
|
-
.url → relative URL "/CA/San-Francisco/..."
|
|
384
|
-
.listingRemarks → description text (may be truncated)
|
|
385
|
-
.keyFacts → [{'description': str, 'rank': int}]
|
|
386
|
-
.listingTags → ['SWEEPING CITY VIEWS', ...]
|
|
387
|
-
.hoa.value → HOA monthly (int)
|
|
388
|
-
.location.value → neighborhood name string
|
|
389
|
-
.sashes → [{'sashTypeName': 'New'/'Price Drop'/...}]
|
|
390
|
-
.photos.value → photo token string
|
|
391
|
-
.numPictures → int
|
|
392
|
-
```
|
|
393
|
-
|
|
394
|
-
---
|
|
395
|
-
|
|
396
|
-
## Alternative APIs (no scraping required)
|
|
397
|
-
|
|
398
|
-
If you need property data without scraping Zillow or Redfin at scale:
|
|
399
|
-
|
|
400
|
-
| API | Free tier | Key data |
|
|
401
|
-
|---|---|---|
|
|
402
|
-
| **ATTOM Data** (attomdata.com) | Trial available | Ownership, AVM, tax, sale history, building characteristics |
|
|
403
|
-
| **Rentcast** (rentcastapi.com) | 50 req/mo free | Rental estimates, comps, market data |
|
|
404
|
-
| **RapidAPI: Zillow56** | ~100 req/mo free | Wraps Zillow data (unofficial, use at own risk) |
|
|
405
|
-
| **HouseCanary** | Paid | AVM, market risk, rental value |
|
|
406
|
-
| **Redfin API** (unofficial, above) | Unlimited | MLS listing data |
|
|
407
|
-
| **US Census / HUD** | Free, no key | Median home values by geography, affordability |
|
|
408
|
-
|
|
409
|
-
---
|
|
410
|
-
|
|
411
|
-
## Gotchas
|
|
412
|
-
|
|
413
|
-
- **Single User-Agent word triggers 403.** `http_get` passes `"Mozilla/5.0"` as default User-Agent — this gets blocked. Always pass the full Chrome UA via the `headers=` argument.
|
|
414
|
-
|
|
415
|
-
- **`price` field is `None` for sold and rental multi-unit listings.** Use `unformattedPrice` for for-sale, `hdpData.homeInfo.priceForHDP` for sold, and `minBaseRent`/`maxBaseRent` for rentals.
|
|
416
|
-
|
|
417
|
-
- **`/homedetails/` is unconditionally blocked.** Tested with full browser headers, Referer, Sec-Fetch-* headers — all return HTTP 403. Only the browser bypasses this.
|
|
418
|
-
|
|
419
|
-
- **41 listings per page, hardcoded.** Zillow always returns exactly 41 results per page from `listResults`. `mapResults` was empty in all tests (server-side response only).
|
|
420
|
-
|
|
421
|
-
- **`isBot: False` doesn't mean you're safe.** Zillow correctly identifies server-side requests and blocks `/homedetails/`. The `isBot` flag in `__NEXT_DATA__` is `False` for search pages but the restriction is enforced at route level for detail pages.
|
|
422
|
-
|
|
423
|
-
- **Captcha strings in HTML do not mean CAPTCHA is active.** The search page includes the captcha widget JavaScript (for lazy loading if needed) but does not serve a challenge — confirmed by successfully parsing listing data from the same HTML.
|
|
424
|
-
|
|
425
|
-
- **Redfin `{}&&` prefix on all API responses.** Strip with `raw[4:]` before `json.loads()`. If the prefix changes, the assertion fails explicitly.
|
|
426
|
-
|
|
427
|
-
- **Redfin JSON-LD omits price.** The `SingleFamilyResidence` schema objects do not include an `offers` field — use the stingray API for pricing.
|
|
428
|
-
|
|
429
|
-
- **Redfin stingray API returns all listing fields wrapped in `{'value': X, 'level': N}` dicts.** Always read `.value` for numeric fields (e.g. `home['price']['value']`, not `home['price']`). Level `1` means data is public; `2` means potentially restricted.
|
|
430
|
-
|
|
431
|
-
- **Zillow total count can exceed 800 but pagination caps at page ~20.** Zillow caps search results at around 800 listings even if `totalResultCount` shows 1037. Narrow by ZIP code, neighborhood, or price range to stay within bounds.
|
|
432
|
-
|
|
433
|
-
- **URL filter syntax for Zillow:** Beds: `3-_beds` prefix; price: `0-1800000_price` suffix; ZIP: use `{zip}_rb` instead of city slug. Test by building the URL in a browser and copying the pattern.
|
|
1
|
+
# Zillow — Scraping & Data Extraction
|
|
2
|
+
|
|
3
|
+
Field-tested against `www.zillow.com` on 2026-04-18 using `http_get` (no browser).
|
|
4
|
+
|
|
5
|
+
## Quick summary
|
|
6
|
+
|
|
7
|
+
- **Search listing pages (`/homes/`, `/sold/`, `/rentals/`)** — `http_get` works with full Chrome headers. Returns ~973 KB HTML with all listing data embedded in `__NEXT_DATA__` JSON.
|
|
8
|
+
- **Individual property detail pages (`/homedetails/`)** — `http_get` returns **HTTP 403** unconditionally. No header combination bypasses this.
|
|
9
|
+
- **Internal API endpoints** (`/async-create-search-page-state`, `/graphql/`) — **403** for all server-side requests regardless of headers.
|
|
10
|
+
- **Redfin** — `http_get` works; HTML contains both JSON-LD per listing and a stingray JSON API.
|
|
11
|
+
|
|
12
|
+
---
|
|
13
|
+
|
|
14
|
+
## What works: search listing pages via `__NEXT_DATA__`
|
|
15
|
+
|
|
16
|
+
Zillow search pages embed all listing data in `<script id="__NEXT_DATA__">`. This is standard Next.js SSR output — it is the same data Zillow's React app hydrates from.
|
|
17
|
+
|
|
18
|
+
**Required headers** — The single-word User-Agent (`"Mozilla/5.0"`) used by `http_get` internally gets 403. You must pass a full Chrome UA plus Accept/Accept-Language headers:
|
|
19
|
+
|
|
20
|
+
```python
|
|
21
|
+
import re, json
|
|
22
|
+
from helpers import http_get
|
|
23
|
+
|
|
24
|
+
HEADERS = {
|
|
25
|
+
"User-Agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/120.0.0.0 Safari/537.36",
|
|
26
|
+
"Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8",
|
|
27
|
+
"Accept-Language": "en-US,en;q=0.9",
|
|
28
|
+
}
|
|
29
|
+
|
|
30
|
+
def extract_listings(html):
|
|
31
|
+
"""Parse Zillow __NEXT_DATA__ and return list of listing dicts."""
|
|
32
|
+
m = re.search(r'<script id="__NEXT_DATA__"[^>]*>(.*?)</script>', html, re.DOTALL)
|
|
33
|
+
if not m:
|
|
34
|
+
return []
|
|
35
|
+
d = json.loads(m.group(1))
|
|
36
|
+
sps = d['props']['pageProps']['searchPageState']
|
|
37
|
+
return sps['cat1']['searchResults']['listResults']
|
|
38
|
+
|
|
39
|
+
html = http_get("https://www.zillow.com/homes/San-Francisco,-CA_rb/", headers=HEADERS)
|
|
40
|
+
listings = extract_listings(html)
|
|
41
|
+
print(len(listings)) # 41 — always 41 per page
|
|
42
|
+
```
|
|
43
|
+
|
|
44
|
+
### Fields available in each listing card
|
|
45
|
+
|
|
46
|
+
The `listResults` array is the canonical source. Each entry includes:
|
|
47
|
+
|
|
48
|
+
| Field | Source | Example |
|
|
49
|
+
|---|---|---|
|
|
50
|
+
| `zpid` | listing | `15081707` |
|
|
51
|
+
| `address` | listing | `"212 Spruce St, San Francisco, CA 94118"` |
|
|
52
|
+
| `addressStreet`, `addressCity`, `addressState`, `addressZipcode` | listing | split address components |
|
|
53
|
+
| `price` | listing | `"$4,395,000"` (formatted string) |
|
|
54
|
+
| `unformattedPrice` | listing | `4395000` (int, use for math) |
|
|
55
|
+
| `beds` | listing | `4` |
|
|
56
|
+
| `baths` | listing | `4` |
|
|
57
|
+
| `area` | listing | `4133` (sqft) |
|
|
58
|
+
| `latLong` | listing | `{'latitude': 37.78867, 'longitude': -122.45361}` |
|
|
59
|
+
| `statusType` | listing | `"FOR_SALE"` / `"FOR_RENT"` / `"RECENTLY_SOLD"` |
|
|
60
|
+
| `detailUrl` | listing | full `https://www.zillow.com/homedetails/...` URL |
|
|
61
|
+
| `zestimate` | listing | `4857200` (Zillow AI estimate, int) |
|
|
62
|
+
| `imgSrc` | listing | thumbnail URL |
|
|
63
|
+
| `has3DModel` | listing | `True`/`False` |
|
|
64
|
+
| `hasOpenHouse` | listing | `True`/`False` |
|
|
65
|
+
| `openHouseStartDate`, `openHouseEndDate` | listing | ISO strings |
|
|
66
|
+
| `isFeaturedListing` | listing | sponsored/featured flag |
|
|
67
|
+
| `brokerName` | listing | `"Sotheby's International Realty"` |
|
|
68
|
+
| `statusText` | listing | `"FOR SALE"` display string |
|
|
69
|
+
| `hdpData.homeInfo.price` | nested | raw price int (matches `unformattedPrice`) |
|
|
70
|
+
| `hdpData.homeInfo.zestimate` | nested | raw zestimate int |
|
|
71
|
+
| `hdpData.homeInfo.rentZestimate` | nested | monthly rent estimate |
|
|
72
|
+
| `hdpData.homeInfo.homeType` | nested | `"SINGLE_FAMILY"`, `"CONDO"`, `"TOWNHOUSE"` etc. |
|
|
73
|
+
| `hdpData.homeInfo.daysOnZillow` | nested | int |
|
|
74
|
+
| `hdpData.homeInfo.taxAssessedValue` | nested | int |
|
|
75
|
+
| `hdpData.homeInfo.lotAreaValue` + `lotAreaUnit` | nested | e.g. `2957.724`, `"sqft"` |
|
|
76
|
+
| `hdpData.homeInfo.priceForHDP` | nested | reliable sold price for recently-sold listings |
|
|
77
|
+
|
|
78
|
+
```python
|
|
79
|
+
# Full extraction snippet
|
|
80
|
+
listing = listings[0]
|
|
81
|
+
hi = listing.get('hdpData', {}).get('homeInfo', {})
|
|
82
|
+
|
|
83
|
+
record = {
|
|
84
|
+
"zpid": listing['zpid'],
|
|
85
|
+
"address": listing['address'],
|
|
86
|
+
"price_raw": listing.get('unformattedPrice') or hi.get('price'),
|
|
87
|
+
"beds": listing.get('beds'),
|
|
88
|
+
"baths": listing.get('baths'),
|
|
89
|
+
"sqft": listing.get('area'),
|
|
90
|
+
"lat": listing['latLong']['latitude'],
|
|
91
|
+
"lon": listing['latLong']['longitude'],
|
|
92
|
+
"status": listing['statusType'],
|
|
93
|
+
"zestimate": listing.get('zestimate'),
|
|
94
|
+
"rent_zest": hi.get('rentZestimate'),
|
|
95
|
+
"home_type": hi.get('homeType'),
|
|
96
|
+
"days_listed": hi.get('daysOnZillow'),
|
|
97
|
+
"tax_assessed": hi.get('taxAssessedValue'),
|
|
98
|
+
"url": listing['detailUrl'],
|
|
99
|
+
}
|
|
100
|
+
```
|
|
101
|
+
|
|
102
|
+
### Total result count and pagination
|
|
103
|
+
|
|
104
|
+
```python
|
|
105
|
+
d = json.loads(re.search(r'<script id="__NEXT_DATA__"[^>]*>(.*?)</script>', html, re.DOTALL).group(1))
|
|
106
|
+
sps = d['props']['pageProps']['searchPageState']
|
|
107
|
+
|
|
108
|
+
# Total listings in this search
|
|
109
|
+
total = sps['categoryTotals']['cat1']['totalResultCount']
|
|
110
|
+
print(total) # 1037
|
|
111
|
+
|
|
112
|
+
# Each page returns exactly 41 listings. Add /<N>_p/ for subsequent pages:
|
|
113
|
+
# Page 2: https://www.zillow.com/homes/San-Francisco,-CA_rb/2_p/
|
|
114
|
+
# Page 3: https://www.zillow.com/homes/San-Francisco,-CA_rb/3_p/
|
|
115
|
+
|
|
116
|
+
max_pages = (total + 40) // 41
|
|
117
|
+
```
|
|
118
|
+
|
|
119
|
+
### Scrape all pages
|
|
120
|
+
|
|
121
|
+
```python
|
|
122
|
+
import re, json, time
|
|
123
|
+
from helpers import http_get
|
|
124
|
+
|
|
125
|
+
HEADERS = {
|
|
126
|
+
"User-Agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/120.0.0.0 Safari/537.36",
|
|
127
|
+
"Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8",
|
|
128
|
+
"Accept-Language": "en-US,en;q=0.9",
|
|
129
|
+
}
|
|
130
|
+
|
|
131
|
+
def get_listings(city_slug, page=1):
|
|
132
|
+
"""city_slug: e.g. 'San-Francisco,-CA', 'Seattle,-WA', 'Austin,-TX'"""
|
|
133
|
+
if page == 1:
|
|
134
|
+
url = f"https://www.zillow.com/homes/{city_slug}_rb/"
|
|
135
|
+
else:
|
|
136
|
+
url = f"https://www.zillow.com/homes/{city_slug}_rb/{page}_p/"
|
|
137
|
+
html = http_get(url, headers=HEADERS)
|
|
138
|
+
m = re.search(r'<script id="__NEXT_DATA__"[^>]*>(.*?)</script>', html, re.DOTALL)
|
|
139
|
+
d = json.loads(m.group(1))
|
|
140
|
+
sps = d['props']['pageProps']['searchPageState']
|
|
141
|
+
total = sps['categoryTotals']['cat1']['totalResultCount']
|
|
142
|
+
listings = sps['cat1']['searchResults']['listResults']
|
|
143
|
+
return listings, total
|
|
144
|
+
|
|
145
|
+
all_listings = []
|
|
146
|
+
listings, total = get_listings("San-Francisco,-CA")
|
|
147
|
+
all_listings.extend(listings)
|
|
148
|
+
|
|
149
|
+
max_pages = (total + 40) // 41
|
|
150
|
+
for page in range(2, min(max_pages + 1, 6)): # cap at 5 pages for demo
|
|
151
|
+
time.sleep(1.0) # polite delay
|
|
152
|
+
page_listings, _ = get_listings("San-Francisco,-CA", page)
|
|
153
|
+
all_listings.extend(page_listings)
|
|
154
|
+
|
|
155
|
+
print(f"Fetched {len(all_listings)} of {total} listings")
|
|
156
|
+
```
|
|
157
|
+
|
|
158
|
+
---
|
|
159
|
+
|
|
160
|
+
## URL patterns that work (all confirmed)
|
|
161
|
+
|
|
162
|
+
| URL pattern | Status | Notes |
|
|
163
|
+
|---|---|---|
|
|
164
|
+
| `/homes/{city}_rb/` | **Works** | For-sale listings |
|
|
165
|
+
| `/homes/{city}_rb/{N}_p/` | **Works** | Pagination |
|
|
166
|
+
| `/homes/for_sale/{city}/0-1800000_price/` | **Works** | Price filter (max) |
|
|
167
|
+
| `/homes/3-_beds/{city}/` | **Works** | Bed count filter |
|
|
168
|
+
| `/homes/{zip}_rb/` | **Works** | ZIP code search |
|
|
169
|
+
| `/san-francisco-ca/rentals/` | **Works** | Rental listings |
|
|
170
|
+
| `/san-francisco-ca/sold/` | **Works** | Recently sold |
|
|
171
|
+
| `/homedetails/{address}/{zpid}_zpid/` | **403** | Single property detail |
|
|
172
|
+
| `/async-create-search-page-state` | **403** | Internal search API |
|
|
173
|
+
| `/graphql/` | **400/403** | GraphQL endpoint |
|
|
174
|
+
|
|
175
|
+
---
|
|
176
|
+
|
|
177
|
+
## Rental listings
|
|
178
|
+
|
|
179
|
+
Rental search pages use the same `__NEXT_DATA__` structure. However, rental listing cards have a **different schema** — individual units are nested, not a flat price:
|
|
180
|
+
|
|
181
|
+
```python
|
|
182
|
+
html = http_get("https://www.zillow.com/san-francisco-ca/rentals/", headers=HEADERS)
|
|
183
|
+
listings = extract_listings(html)
|
|
184
|
+
|
|
185
|
+
r = listings[0]
|
|
186
|
+
# Multi-unit buildings:
|
|
187
|
+
# r['units'] = [{'price': '$3,485+', 'beds': '0', 'roomForRent': False}, ...]
|
|
188
|
+
# r['minBaseRent'] = 3485
|
|
189
|
+
# r['maxBaseRent'] = 7130
|
|
190
|
+
# r['availabilityCount'] = 23
|
|
191
|
+
|
|
192
|
+
# Single-unit rentals:
|
|
193
|
+
# r['price'] = '$2,500/mo'
|
|
194
|
+
# r['unformattedPrice'] = 2500
|
|
195
|
+
|
|
196
|
+
# Check which type:
|
|
197
|
+
if r.get('isBuilding'):
|
|
198
|
+
price_range = f"${r['minBaseRent']}–${r['maxBaseRent']}/mo"
|
|
199
|
+
units = r.get('units', [])
|
|
200
|
+
else:
|
|
201
|
+
price = r.get('unformattedPrice') or r.get('hdpData', {}).get('homeInfo', {}).get('price')
|
|
202
|
+
```
|
|
203
|
+
|
|
204
|
+
---
|
|
205
|
+
|
|
206
|
+
## Sold listings
|
|
207
|
+
|
|
208
|
+
Sold pages (`/sold/`) work identically. Key difference: `statusType` is `"RECENTLY_SOLD"` and price comes from `hdpData.homeInfo.priceForHDP` (not the `price` field which is `None` in sold cards):
|
|
209
|
+
|
|
210
|
+
```python
|
|
211
|
+
html = http_get("https://www.zillow.com/san-francisco-ca/sold/", headers=HEADERS)
|
|
212
|
+
listings = extract_listings(html)
|
|
213
|
+
|
|
214
|
+
for l in listings:
|
|
215
|
+
hi = l.get('hdpData', {}).get('homeInfo', {})
|
|
216
|
+
sold_price = hi.get('priceForHDP') # actual sold price
|
|
217
|
+
zestimate = hi.get('zestimate')
|
|
218
|
+
tax_value = hi.get('taxAssessedValue')
|
|
219
|
+
print(l['address'], f"${sold_price:,}", f"zest=${zestimate}")
|
|
220
|
+
# 999 Green St APT 1702, San Francisco, CA 94133 $3,200,000 zest=$3,403,400
|
|
221
|
+
# 1041 Vallejo St, San Francisco, CA 94133 $6,250,000 zest=None
|
|
222
|
+
```
|
|
223
|
+
|
|
224
|
+
Total sold inventory in San Francisco: **18,109** (all time in Zillow's database, paginated 41/page).
|
|
225
|
+
|
|
226
|
+
---
|
|
227
|
+
|
|
228
|
+
## Bot detection behavior
|
|
229
|
+
|
|
230
|
+
- **Zillow detects bot status server-side** and embeds `window.__USER_SESSION_INITIAL_STATE__` and `props.isBot` in the page.
|
|
231
|
+
- In field testing, the page returned `isBot: False` with the Chrome User-Agent — **Zillow does not block the search pages**.
|
|
232
|
+
- The page does embed `captcha` strings in the HTML (for the CAPTCHA challenge widget code), but the challenge is NOT triggered for search pages.
|
|
233
|
+
- **`/homedetails/` pages do trigger blocking** — every property detail URL tested returned HTTP 403. This is enforced before serving HTML, not via JavaScript CAPTCHA.
|
|
234
|
+
- Rate limiting: 3 rapid sequential requests to `/homes/` all succeeded. Observed no 429s. Add `time.sleep(0.5–1.0)` between pages as a courtesy.
|
|
235
|
+
|
|
236
|
+
---
|
|
237
|
+
|
|
238
|
+
## What you do NOT get from `http_get`
|
|
239
|
+
|
|
240
|
+
Because property detail pages are blocked (403), you lose:
|
|
241
|
+
|
|
242
|
+
- Full property description text
|
|
243
|
+
- All listing photos (you only get `imgSrc` thumbnail from search)
|
|
244
|
+
- Detailed home facts (year built, parking, HVAC, school scores)
|
|
245
|
+
- Price history
|
|
246
|
+
- Nearby comparable sales (comps)
|
|
247
|
+
- Agent contact info
|
|
248
|
+
|
|
249
|
+
**To get these**, you must navigate to the `/homedetails/` URL in a browser session. The browser is not blocked (Zillow relies on JS challenges and fingerprinting that only trigger in browser context).
|
|
250
|
+
|
|
251
|
+
---
|
|
252
|
+
|
|
253
|
+
## Alternative: Redfin (field-tested, more accessible)
|
|
254
|
+
|
|
255
|
+
Redfin allows `http_get` with no blocking for both HTML pages and its internal API.
|
|
256
|
+
|
|
257
|
+
### Redfin JSON-LD per listing (easiest)
|
|
258
|
+
|
|
259
|
+
Each Redfin search results page embeds one `<script type="application/ld+json">` per listing with structured property data:
|
|
260
|
+
|
|
261
|
+
```python
|
|
262
|
+
import re, json
|
|
263
|
+
from helpers import http_get
|
|
264
|
+
|
|
265
|
+
HEADERS = {
|
|
266
|
+
"User-Agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/120.0.0.0 Safari/537.36",
|
|
267
|
+
"Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8",
|
|
268
|
+
"Accept-Language": "en-US,en;q=0.9",
|
|
269
|
+
}
|
|
270
|
+
|
|
271
|
+
html = http_get(
|
|
272
|
+
"https://www.redfin.com/city/17151/CA/San-Francisco/filter/property-type=house",
|
|
273
|
+
headers=HEADERS
|
|
274
|
+
)
|
|
275
|
+
print(len(html)) # ~1.6 MB
|
|
276
|
+
|
|
277
|
+
# Extract all SingleFamilyResidence JSON-LD entries
|
|
278
|
+
properties = []
|
|
279
|
+
for s in re.findall(r'<script type="application/ld\+json">(.*?)</script>', html, re.DOTALL):
|
|
280
|
+
try:
|
|
281
|
+
d = json.loads(s)
|
|
282
|
+
if isinstance(d, list):
|
|
283
|
+
for item in d:
|
|
284
|
+
if item.get('@type') in ('SingleFamilyResidence', 'House', 'Residence', 'Apartment'):
|
|
285
|
+
properties.append(item)
|
|
286
|
+
except Exception:
|
|
287
|
+
pass
|
|
288
|
+
|
|
289
|
+
prop = properties[0]
|
|
290
|
+
print("Name:", prop['name']) # "662 Hampshire St, San Francisco, CA 94110"
|
|
291
|
+
print("Address:", prop['address'])
|
|
292
|
+
# {'@type': 'PostalAddress', 'streetAddress': '662 Hampshire St',
|
|
293
|
+
# 'addressLocality': 'San Francisco', 'addressRegion': 'CA',
|
|
294
|
+
# 'postalCode': '94110', 'addressCountry': 'US'}
|
|
295
|
+
print("Rooms:", prop['numberOfRooms']) # 3
|
|
296
|
+
print("Floor size:", prop['floorSize']) # {'@type': 'QuantitativeValue', 'value': 3350, 'unitCode': 'FTK'}
|
|
297
|
+
print("URL:", prop['url'])
|
|
298
|
+
# https://www.redfin.com/CA/San-Francisco/662-Hampshire-St-94110/home/1533754
|
|
299
|
+
```
|
|
300
|
+
|
|
301
|
+
Note: The JSON-LD schema does NOT include price (Redfin omits `offers` from the LD+JSON). Use the stingray API below for price.
|
|
302
|
+
|
|
303
|
+
### Redfin stingray API (structured JSON with price)
|
|
304
|
+
|
|
305
|
+
Redfin's internal GIS/search API returns rich structured data including price, MLS ID, beds, baths, sqft, agent info, and remarks. Responses are prefixed with `{}&&` — strip it before parsing:
|
|
306
|
+
|
|
307
|
+
```python
|
|
308
|
+
import json
|
|
309
|
+
from helpers import http_get
|
|
310
|
+
|
|
311
|
+
def redfin_search(region_id, region_type=6, num_homes=20, page=1, uipt="1,2,3,4,5,6"):
|
|
312
|
+
"""
|
|
313
|
+
region_type: 6=city, 2=zipcode, 5=county
|
|
314
|
+
uipt: property types (1=house, 2=condo, 3=townhouse, 4=multi-family, 5=land, 6=other)
|
|
315
|
+
"""
|
|
316
|
+
url = (
|
|
317
|
+
f"https://www.redfin.com/stingray/api/gis"
|
|
318
|
+
f"?al=1&num_homes={num_homes}&ord=redfin-recommended-asc"
|
|
319
|
+
f"&page_number={page}®ion_id={region_id}®ion_type={region_type}"
|
|
320
|
+
f"&sf=1,2,3,5,6,7&status=9&uipt={uipt}&v=8"
|
|
321
|
+
)
|
|
322
|
+
raw = http_get(url, headers={
|
|
323
|
+
"User-Agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/120.0.0.0 Safari/537.36",
|
|
324
|
+
"Referer": "https://www.redfin.com/",
|
|
325
|
+
"Accept": "*/*",
|
|
326
|
+
})
|
|
327
|
+
# Strip the {}&& CSRF prefix Redfin prepends to all API responses
|
|
328
|
+
assert raw.startswith('{}&&'), f"Unexpected prefix: {raw[:10]}"
|
|
329
|
+
return json.loads(raw[4:])
|
|
330
|
+
|
|
331
|
+
data = redfin_search(region_id=17151) # 17151 = San Francisco, CA
|
|
332
|
+
homes = data['payload']['homes']
|
|
333
|
+
|
|
334
|
+
home = homes[0]
|
|
335
|
+
print("Address:", home['streetLine']['value']) # "875 California St #703"
|
|
336
|
+
print("City/State/Zip:", home['city'], home['state'], home['zip'])
|
|
337
|
+
print("Price:", home['price']['value']) # 3300000
|
|
338
|
+
print("Beds:", home['beds']) # 3
|
|
339
|
+
print("Baths:", home['baths']) # 2.5
|
|
340
|
+
print("Sqft:", home['sqFt']['value']) # 1828
|
|
341
|
+
print("$/sqft:", home['pricePerSqFt']['value']) # 1805
|
|
342
|
+
print("Lot size:", home['lotSize']['value']) # 9448
|
|
343
|
+
print("Year built:", home['yearBuilt']['value']) # 2021
|
|
344
|
+
print("Days on market:", home['dom']['value']) # 1
|
|
345
|
+
print("MLS ID:", home['mlsId']['value']) # "426115342"
|
|
346
|
+
print("MLS Status:", home['mlsStatus']) # "Active"
|
|
347
|
+
print("Lat/Long:", home['latLong']['value'])
|
|
348
|
+
print("URL:", home['url']) # "/CA/San-Francisco/..."
|
|
349
|
+
print("Remarks:", home['listingRemarks'][:100])
|
|
350
|
+
```
|
|
351
|
+
|
|
352
|
+
### Redfin region IDs
|
|
353
|
+
|
|
354
|
+
| City | region_id | region_type |
|
|
355
|
+
|---|---|---|
|
|
356
|
+
| San Francisco, CA | `17151` | `6` (city) |
|
|
357
|
+
| Los Angeles, CA | `17152` | `6` |
|
|
358
|
+
| New York, NY | `17834` | `6` |
|
|
359
|
+
| Seattle, WA | `16163` | `6` |
|
|
360
|
+
|
|
361
|
+
To find other region IDs: search on Redfin, look at the URL (e.g. `/city/17151/CA/San-Francisco`) — the number is the region_id.
|
|
362
|
+
|
|
363
|
+
### Redfin stingray response structure
|
|
364
|
+
|
|
365
|
+
```
|
|
366
|
+
data['payload']['homes'][i]
|
|
367
|
+
.streetLine.value → street address string
|
|
368
|
+
.city / .state / .zip → strings
|
|
369
|
+
.price.value → int (asking price in dollars)
|
|
370
|
+
.sqFt.value → int (square feet)
|
|
371
|
+
.pricePerSqFt.value → int
|
|
372
|
+
.beds → int
|
|
373
|
+
.baths → float (2.5 = 2 full + 1 half)
|
|
374
|
+
.fullBaths / .partialBaths → ints
|
|
375
|
+
.lotSize.value → int (sq ft)
|
|
376
|
+
.yearBuilt.value → int
|
|
377
|
+
.dom.value → days on market (int)
|
|
378
|
+
.mlsId.value → MLS listing number (string)
|
|
379
|
+
.mlsStatus → "Active", "Pending", etc.
|
|
380
|
+
.listingId → Redfin internal int
|
|
381
|
+
.propertyId → Redfin internal int
|
|
382
|
+
.latLong.value → {'latitude': float, 'longitude': float}
|
|
383
|
+
.url → relative URL "/CA/San-Francisco/..."
|
|
384
|
+
.listingRemarks → description text (may be truncated)
|
|
385
|
+
.keyFacts → [{'description': str, 'rank': int}]
|
|
386
|
+
.listingTags → ['SWEEPING CITY VIEWS', ...]
|
|
387
|
+
.hoa.value → HOA monthly (int)
|
|
388
|
+
.location.value → neighborhood name string
|
|
389
|
+
.sashes → [{'sashTypeName': 'New'/'Price Drop'/...}]
|
|
390
|
+
.photos.value → photo token string
|
|
391
|
+
.numPictures → int
|
|
392
|
+
```
|
|
393
|
+
|
|
394
|
+
---
|
|
395
|
+
|
|
396
|
+
## Alternative APIs (no scraping required)
|
|
397
|
+
|
|
398
|
+
If you need property data without scraping Zillow or Redfin at scale:
|
|
399
|
+
|
|
400
|
+
| API | Free tier | Key data |
|
|
401
|
+
|---|---|---|
|
|
402
|
+
| **ATTOM Data** (attomdata.com) | Trial available | Ownership, AVM, tax, sale history, building characteristics |
|
|
403
|
+
| **Rentcast** (rentcastapi.com) | 50 req/mo free | Rental estimates, comps, market data |
|
|
404
|
+
| **RapidAPI: Zillow56** | ~100 req/mo free | Wraps Zillow data (unofficial, use at own risk) |
|
|
405
|
+
| **HouseCanary** | Paid | AVM, market risk, rental value |
|
|
406
|
+
| **Redfin API** (unofficial, above) | Unlimited | MLS listing data |
|
|
407
|
+
| **US Census / HUD** | Free, no key | Median home values by geography, affordability |
|
|
408
|
+
|
|
409
|
+
---
|
|
410
|
+
|
|
411
|
+
## Gotchas
|
|
412
|
+
|
|
413
|
+
- **Single User-Agent word triggers 403.** `http_get` passes `"Mozilla/5.0"` as default User-Agent — this gets blocked. Always pass the full Chrome UA via the `headers=` argument.
|
|
414
|
+
|
|
415
|
+
- **`price` field is `None` for sold and rental multi-unit listings.** Use `unformattedPrice` for for-sale, `hdpData.homeInfo.priceForHDP` for sold, and `minBaseRent`/`maxBaseRent` for rentals.
|
|
416
|
+
|
|
417
|
+
- **`/homedetails/` is unconditionally blocked.** Tested with full browser headers, Referer, Sec-Fetch-* headers — all return HTTP 403. Only the browser bypasses this.
|
|
418
|
+
|
|
419
|
+
- **41 listings per page, hardcoded.** Zillow always returns exactly 41 results per page from `listResults`. `mapResults` was empty in all tests (server-side response only).
|
|
420
|
+
|
|
421
|
+
- **`isBot: False` doesn't mean you're safe.** Zillow correctly identifies server-side requests and blocks `/homedetails/`. The `isBot` flag in `__NEXT_DATA__` is `False` for search pages but the restriction is enforced at route level for detail pages.
|
|
422
|
+
|
|
423
|
+
- **Captcha strings in HTML do not mean CAPTCHA is active.** The search page includes the captcha widget JavaScript (for lazy loading if needed) but does not serve a challenge — confirmed by successfully parsing listing data from the same HTML.
|
|
424
|
+
|
|
425
|
+
- **Redfin `{}&&` prefix on all API responses.** Strip with `raw[4:]` before `json.loads()`. If the prefix changes, the assertion fails explicitly.
|
|
426
|
+
|
|
427
|
+
- **Redfin JSON-LD omits price.** The `SingleFamilyResidence` schema objects do not include an `offers` field — use the stingray API for pricing.
|
|
428
|
+
|
|
429
|
+
- **Redfin stingray API returns all listing fields wrapped in `{'value': X, 'level': N}` dicts.** Always read `.value` for numeric fields (e.g. `home['price']['value']`, not `home['price']`). Level `1` means data is public; `2` means potentially restricted.
|
|
430
|
+
|
|
431
|
+
- **Zillow total count can exceed 800 but pagination caps at page ~20.** Zillow caps search results at around 800 listings even if `totalResultCount` shows 1037. Narrow by ZIP code, neighborhood, or price range to stay within bounds.
|
|
432
|
+
|
|
433
|
+
- **URL filter syntax for Zillow:** Beds: `3-_beds` prefix; price: `0-1800000_price` suffix; ZIP: use `{zip}_rb` instead of city slug. Test by building the URL in a browser and copying the pattern.
|