@pencil-agent/nano-pencil 2.0.1 → 2.0.2

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
Files changed (186) hide show
  1. package/README.md +267 -267
  2. package/dist/build-meta.json +3 -3
  3. package/dist/core/export-html/AGENT.md +11 -11
  4. package/dist/core/export-html/template.css +971 -971
  5. package/dist/core/export-html/template.html +54 -54
  6. package/dist/core/model/custom-providers.js +1 -1
  7. package/dist/core/model-registry.js +5 -5
  8. package/dist/extensions/builtin/AGENT.md +115 -115
  9. package/dist/extensions/builtin/browser/AGENT.md +17 -17
  10. package/dist/extensions/builtin/browser/agent-workspace/agent_helpers.py +12 -12
  11. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/amazon/product-search.md +198 -198
  12. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/archive-org/scraping.md +341 -341
  13. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/arxiv/scraping.md +311 -311
  14. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/arxiv-bulk/scraping.md +333 -333
  15. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/atlas/overview.md +70 -70
  16. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/booking-com/scraping.md +578 -578
  17. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/capterra/scraping.md +440 -440
  18. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/centilebrain/generate-estimates.md +110 -110
  19. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/coingecko/scraping.md +325 -325
  20. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/coinmarketcap/scraping.md +463 -463
  21. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/coursera/scraping.md +360 -360
  22. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/craigslist/scraping.md +390 -390
  23. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/crossref/scraping.md +568 -568
  24. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/dev-to/scraping.md +323 -323
  25. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/duckduckgo/scraping.md +349 -349
  26. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/ebay/scraping.md +435 -435
  27. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/etsy/scraping.md +506 -506
  28. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/eventbrite/scraping.md +363 -363
  29. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/expedia/automation.md +168 -168
  30. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/facebook/groups.md +236 -236
  31. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/facebook/pages.md +295 -295
  32. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/framer/editor.md +108 -108
  33. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/fred/scraping.md +493 -493
  34. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/g2/scraping.md +580 -580
  35. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/genius/scraping.md +511 -511
  36. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/github/repo-actions.md +65 -65
  37. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/github/scraping.md +184 -184
  38. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/glassdoor/scraping.md +543 -543
  39. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/gmail/compose.md +122 -122
  40. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/goodreads/scraping.md +461 -461
  41. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/gutenberg/scraping.md +383 -383
  42. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/hackernews/scraping.md +243 -243
  43. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/howlongtobeat/scraping.md +473 -473
  44. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/imdb/scraping.md +271 -271
  45. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/itch-io/scraping.md +436 -436
  46. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/job-boards/indeed-glassdoor.md +1021 -1021
  47. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/letterboxd/scraping.md +349 -349
  48. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/linkedin/invitation-manager.md +109 -109
  49. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/loom/folder-enumeration.md +170 -170
  50. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/macrotrends/scraping.md +537 -537
  51. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/medium/article-hydration.md +120 -120
  52. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/medium/scraping.md +414 -414
  53. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/metacritic/scraping.md +477 -477
  54. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/musicbrainz/scraping.md +478 -478
  55. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/nasa/scraping.md +339 -339
  56. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/news-aggregation/multi-source.md +205 -205
  57. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/open-library/scraping.md +472 -472
  58. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/openalex/scraping.md +470 -470
  59. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/openstreetmap/scraping.md +490 -490
  60. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/package-registries/npm-pypi.md +478 -478
  61. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/polymarket/scraping.md +234 -234
  62. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/producthunt/scraping.md +307 -307
  63. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/pubmed/scraping.md +421 -421
  64. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/quora/scraping.md +364 -364
  65. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/rawg/scraping.md +352 -352
  66. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/reddit/scraping.md +124 -124
  67. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/rest-countries/scraping.md +233 -233
  68. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/sec-edgar/scraping.md +361 -361
  69. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/shopify-admin/README.md +36 -36
  70. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/shopify-admin/embedded-apps.md +72 -72
  71. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/shopify-admin/knowledge-base.md +109 -109
  72. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/shopify-admin/polaris-inputs.md +137 -137
  73. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/soundcloud/scraping.md +362 -362
  74. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/spotify/scraping.md +339 -339
  75. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/stackoverflow/scraping.md +435 -435
  76. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/steam/scraping.md +575 -575
  77. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/substack/scraping.md +338 -338
  78. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/thetechgeeks/pricing.md +52 -52
  79. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/tiktok/upload.md +107 -107
  80. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/tradingview/scraping.md +309 -309
  81. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/trello/boards-and-lists.md +88 -88
  82. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/trustpilot/scraping.md +375 -375
  83. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/walmart/scraping.md +444 -444
  84. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/wayback-machine/scraping.md +306 -306
  85. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/weather/scraping.md +398 -398
  86. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/wellfound/scraping.md +596 -596
  87. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/world-bank/scraping.md +356 -356
  88. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/xiaohongshu/scraping.md +84 -84
  89. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/youtube/scraping.md +418 -418
  90. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/zillow/scraping.md +433 -433
  91. package/dist/extensions/builtin/browser/browser.md +73 -73
  92. package/dist/extensions/builtin/browser/install.md +142 -142
  93. package/dist/extensions/builtin/browser/interaction-skills/connection.md +48 -48
  94. package/dist/extensions/builtin/browser/interaction-skills/cookies.md +3 -3
  95. package/dist/extensions/builtin/browser/interaction-skills/cross-origin-iframes.md +3 -3
  96. package/dist/extensions/builtin/browser/interaction-skills/dialogs.md +64 -64
  97. package/dist/extensions/builtin/browser/interaction-skills/downloads.md +3 -3
  98. package/dist/extensions/builtin/browser/interaction-skills/drag-and-drop.md +3 -3
  99. package/dist/extensions/builtin/browser/interaction-skills/dropdowns.md +3 -3
  100. package/dist/extensions/builtin/browser/interaction-skills/iframes.md +3 -3
  101. package/dist/extensions/builtin/browser/interaction-skills/network-requests.md +3 -3
  102. package/dist/extensions/builtin/browser/interaction-skills/print-as-pdf.md +3 -3
  103. package/dist/extensions/builtin/browser/interaction-skills/profile-sync.md +90 -90
  104. package/dist/extensions/builtin/browser/interaction-skills/screenshots.md +17 -17
  105. package/dist/extensions/builtin/browser/interaction-skills/scrolling.md +3 -3
  106. package/dist/extensions/builtin/browser/interaction-skills/shadow-dom.md +3 -3
  107. package/dist/extensions/builtin/browser/interaction-skills/tabs.md +69 -69
  108. package/dist/extensions/builtin/browser/interaction-skills/uploads.md +1 -1
  109. package/dist/extensions/builtin/browser/interaction-skills/viewport.md +3 -3
  110. package/dist/extensions/builtin/browser/src/browser_harness/AGENT.md +15 -15
  111. package/dist/extensions/builtin/browser/src/browser_harness/__init__.py +8 -8
  112. package/dist/extensions/builtin/browser/src/browser_harness/_ipc.py +90 -90
  113. package/dist/extensions/builtin/browser/src/browser_harness/admin.py +722 -722
  114. package/dist/extensions/builtin/browser/src/browser_harness/daemon.py +328 -328
  115. package/dist/extensions/builtin/browser/src/browser_harness/helpers.py +396 -396
  116. package/dist/extensions/builtin/browser/src/browser_harness/run.py +103 -103
  117. package/dist/extensions/builtin/discipline/skills/brainstorming/SKILL.md +33 -33
  118. package/dist/extensions/builtin/discipline/skills/executing-plans/SKILL.md +25 -25
  119. package/dist/extensions/builtin/discipline/skills/finishing-development-branch/SKILL.md +25 -25
  120. package/dist/extensions/builtin/discipline/skills/receiving-code-review/SKILL.md +22 -22
  121. package/dist/extensions/builtin/discipline/skills/requesting-code-review/SKILL.md +31 -31
  122. package/dist/extensions/builtin/discipline/skills/systematic-debugging/SKILL.md +28 -28
  123. package/dist/extensions/builtin/discipline/skills/test-driven-development/SKILL.md +32 -32
  124. package/dist/extensions/builtin/discipline/skills/using-git-worktrees/SKILL.md +25 -25
  125. package/dist/extensions/builtin/discipline/skills/verification-before-completion/SKILL.md +27 -27
  126. package/dist/extensions/builtin/discipline/skills/writing-plans/SKILL.md +26 -26
  127. package/dist/extensions/builtin/goal/README.md +67 -67
  128. package/dist/extensions/builtin/grub/README.md +112 -112
  129. package/dist/extensions/builtin/link-world/agent-workspace/README.md +16 -16
  130. package/dist/extensions/builtin/link-world/internet-search/internet-search.md +65 -65
  131. package/dist/extensions/builtin/link-world/link-world-agent.md +82 -82
  132. package/dist/extensions/builtin/link-world/linkworld.md +313 -313
  133. package/dist/extensions/builtin/link-world/network-routing/network-routing.md +67 -67
  134. package/dist/extensions/builtin/loop/README.md +92 -92
  135. package/dist/extensions/builtin/mcp/figma-design.md +68 -68
  136. package/dist/extensions/builtin/mcp/mcp-management.md +85 -85
  137. package/dist/extensions/builtin/recap/AGENT.md +15 -15
  138. package/dist/extensions/builtin/sal/README.md +72 -72
  139. package/dist/extensions/builtin/security-audit/README.md +289 -289
  140. package/dist/extensions/builtin/team/AGENT.md +112 -112
  141. package/dist/extensions/builtin/team/TESTING.md +299 -299
  142. package/dist/extensions/builtin/token-save/README.md +56 -56
  143. package/dist/extensions/optional/AGENT.md +10 -10
  144. package/dist/modes/interactive/controllers/input-submit-controller.js +2 -2
  145. package/dist/modes/interactive/controllers/stream-render-controller.js +2 -2
  146. package/dist/modes/interactive/interactive-mode.js +19 -19
  147. package/dist/modes/interactive/theme/dark.json +85 -85
  148. package/dist/modes/interactive/theme/light.json +84 -84
  149. package/dist/modes/interactive/theme/theme-schema.json +335 -335
  150. package/dist/modes/interactive/theme/warm.json +81 -81
  151. package/dist/node_modules/@pencil-agent/ai/dist/cli.js +0 -0
  152. package/dist/node_modules/@pencil-agent/ai/dist/models.generated.js +1 -1
  153. package/docs/ACP/345/215/217/350/256/256/351/233/206/346/210/220/345/274/200/345/217/221/346/226/207/346/241/243.md +851 -0
  154. package/docs/SDK-TESTING.md +364 -0
  155. package/docs/codex-goal-command-impl.md +1055 -1055
  156. package/docs/codex-goal-vs-grub.md +500 -500
  157. package/docs/custom-provider.md +27 -27
  158. package/docs/extensions.md +27 -27
  159. package/docs/keybindings.md +27 -27
  160. package/docs/loop /351/207/215/346/236/204/345/256/214/346/210/220/346/200/273/347/273/223.md" +250 -250
  161. package/docs/loop /351/207/215/346/236/204/345/256/214/346/210/220/346/212/245/345/221/212.md" +122 -122
  162. package/docs/loop /351/207/215/346/236/204/346/226/271/346/241/210.md" +1222 -1222
  163. package/docs/loop /351/207/215/346/236/204/346/226/271/346/241/210/345/256/236/347/216/260/346/212/245/345/221/212.md" +158 -158
  164. package/docs/loop /351/207/215/346/236/204/346/226/271/346/241/210/345/257/271/346/257/224/345/210/206/346/236/220.md" +128 -128
  165. package/docs/loop /351/207/215/346/236/204/350/256/241/345/210/222.md" +320 -320
  166. package/docs/loop-usage-examples.md +214 -214
  167. package/docs/mem-core/346/212/200/346/234/257/346/226/207/346/241/243.md +593 -0
  168. package/docs/models.md +27 -27
  169. package/docs/packages.md +27 -27
  170. package/docs/pi-design-philosophy.md +457 -457
  171. package/docs/planmode.md +1987 -1987
  172. package/docs/prompt-templates.md +27 -27
  173. package/docs/providers.md +27 -27
  174. package/docs/sdk.md +27 -27
  175. package/docs/skills.md +27 -27
  176. package/docs/startup-performance-optimization.md +301 -0
  177. package/docs/themes.md +27 -27
  178. package/docs/tui.md +27 -27
  179. package/docs//350/256/244/347/237/245/345/234/260/345/233/276.md +47 -0
  180. package/package.json +190 -190
  181. package/docs/cc-agent-design.md +0 -1297
  182. package/docs/cc-tui-design.md +0 -1333
  183. package/docs/nanoPencil-/345/255/246/344/271/240/350/256/241/345/210/222.md +0 -170
  184. package/docs/scan-report.md +0 -3820
  185. package/docs//345/257/271/346/240/207Claude-Code.md +0 -1775
  186. package/docs//351/230/277/351/207/214/345/267/264/345/267/264/350/264/242/346/212/245/345/210/206/346/236/220/344/271/246.md +0 -261
@@ -1,433 +1,433 @@
1
- # Zillow — Scraping & Data Extraction
2
-
3
- Field-tested against `www.zillow.com` on 2026-04-18 using `http_get` (no browser).
4
-
5
- ## Quick summary
6
-
7
- - **Search listing pages (`/homes/`, `/sold/`, `/rentals/`)** — `http_get` works with full Chrome headers. Returns ~973 KB HTML with all listing data embedded in `__NEXT_DATA__` JSON.
8
- - **Individual property detail pages (`/homedetails/`)** — `http_get` returns **HTTP 403** unconditionally. No header combination bypasses this.
9
- - **Internal API endpoints** (`/async-create-search-page-state`, `/graphql/`) — **403** for all server-side requests regardless of headers.
10
- - **Redfin** — `http_get` works; HTML contains both JSON-LD per listing and a stingray JSON API.
11
-
12
- ---
13
-
14
- ## What works: search listing pages via `__NEXT_DATA__`
15
-
16
- Zillow search pages embed all listing data in `<script id="__NEXT_DATA__">`. This is standard Next.js SSR output — it is the same data Zillow's React app hydrates from.
17
-
18
- **Required headers** — The single-word User-Agent (`"Mozilla/5.0"`) used by `http_get` internally gets 403. You must pass a full Chrome UA plus Accept/Accept-Language headers:
19
-
20
- ```python
21
- import re, json
22
- from helpers import http_get
23
-
24
- HEADERS = {
25
- "User-Agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/120.0.0.0 Safari/537.36",
26
- "Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8",
27
- "Accept-Language": "en-US,en;q=0.9",
28
- }
29
-
30
- def extract_listings(html):
31
- """Parse Zillow __NEXT_DATA__ and return list of listing dicts."""
32
- m = re.search(r'<script id="__NEXT_DATA__"[^>]*>(.*?)</script>', html, re.DOTALL)
33
- if not m:
34
- return []
35
- d = json.loads(m.group(1))
36
- sps = d['props']['pageProps']['searchPageState']
37
- return sps['cat1']['searchResults']['listResults']
38
-
39
- html = http_get("https://www.zillow.com/homes/San-Francisco,-CA_rb/", headers=HEADERS)
40
- listings = extract_listings(html)
41
- print(len(listings)) # 41 — always 41 per page
42
- ```
43
-
44
- ### Fields available in each listing card
45
-
46
- The `listResults` array is the canonical source. Each entry includes:
47
-
48
- | Field | Source | Example |
49
- |---|---|---|
50
- | `zpid` | listing | `15081707` |
51
- | `address` | listing | `"212 Spruce St, San Francisco, CA 94118"` |
52
- | `addressStreet`, `addressCity`, `addressState`, `addressZipcode` | listing | split address components |
53
- | `price` | listing | `"$4,395,000"` (formatted string) |
54
- | `unformattedPrice` | listing | `4395000` (int, use for math) |
55
- | `beds` | listing | `4` |
56
- | `baths` | listing | `4` |
57
- | `area` | listing | `4133` (sqft) |
58
- | `latLong` | listing | `{'latitude': 37.78867, 'longitude': -122.45361}` |
59
- | `statusType` | listing | `"FOR_SALE"` / `"FOR_RENT"` / `"RECENTLY_SOLD"` |
60
- | `detailUrl` | listing | full `https://www.zillow.com/homedetails/...` URL |
61
- | `zestimate` | listing | `4857200` (Zillow AI estimate, int) |
62
- | `imgSrc` | listing | thumbnail URL |
63
- | `has3DModel` | listing | `True`/`False` |
64
- | `hasOpenHouse` | listing | `True`/`False` |
65
- | `openHouseStartDate`, `openHouseEndDate` | listing | ISO strings |
66
- | `isFeaturedListing` | listing | sponsored/featured flag |
67
- | `brokerName` | listing | `"Sotheby's International Realty"` |
68
- | `statusText` | listing | `"FOR SALE"` display string |
69
- | `hdpData.homeInfo.price` | nested | raw price int (matches `unformattedPrice`) |
70
- | `hdpData.homeInfo.zestimate` | nested | raw zestimate int |
71
- | `hdpData.homeInfo.rentZestimate` | nested | monthly rent estimate |
72
- | `hdpData.homeInfo.homeType` | nested | `"SINGLE_FAMILY"`, `"CONDO"`, `"TOWNHOUSE"` etc. |
73
- | `hdpData.homeInfo.daysOnZillow` | nested | int |
74
- | `hdpData.homeInfo.taxAssessedValue` | nested | int |
75
- | `hdpData.homeInfo.lotAreaValue` + `lotAreaUnit` | nested | e.g. `2957.724`, `"sqft"` |
76
- | `hdpData.homeInfo.priceForHDP` | nested | reliable sold price for recently-sold listings |
77
-
78
- ```python
79
- # Full extraction snippet
80
- listing = listings[0]
81
- hi = listing.get('hdpData', {}).get('homeInfo', {})
82
-
83
- record = {
84
- "zpid": listing['zpid'],
85
- "address": listing['address'],
86
- "price_raw": listing.get('unformattedPrice') or hi.get('price'),
87
- "beds": listing.get('beds'),
88
- "baths": listing.get('baths'),
89
- "sqft": listing.get('area'),
90
- "lat": listing['latLong']['latitude'],
91
- "lon": listing['latLong']['longitude'],
92
- "status": listing['statusType'],
93
- "zestimate": listing.get('zestimate'),
94
- "rent_zest": hi.get('rentZestimate'),
95
- "home_type": hi.get('homeType'),
96
- "days_listed": hi.get('daysOnZillow'),
97
- "tax_assessed": hi.get('taxAssessedValue'),
98
- "url": listing['detailUrl'],
99
- }
100
- ```
101
-
102
- ### Total result count and pagination
103
-
104
- ```python
105
- d = json.loads(re.search(r'<script id="__NEXT_DATA__"[^>]*>(.*?)</script>', html, re.DOTALL).group(1))
106
- sps = d['props']['pageProps']['searchPageState']
107
-
108
- # Total listings in this search
109
- total = sps['categoryTotals']['cat1']['totalResultCount']
110
- print(total) # 1037
111
-
112
- # Each page returns exactly 41 listings. Add /<N>_p/ for subsequent pages:
113
- # Page 2: https://www.zillow.com/homes/San-Francisco,-CA_rb/2_p/
114
- # Page 3: https://www.zillow.com/homes/San-Francisco,-CA_rb/3_p/
115
-
116
- max_pages = (total + 40) // 41
117
- ```
118
-
119
- ### Scrape all pages
120
-
121
- ```python
122
- import re, json, time
123
- from helpers import http_get
124
-
125
- HEADERS = {
126
- "User-Agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/120.0.0.0 Safari/537.36",
127
- "Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8",
128
- "Accept-Language": "en-US,en;q=0.9",
129
- }
130
-
131
- def get_listings(city_slug, page=1):
132
- """city_slug: e.g. 'San-Francisco,-CA', 'Seattle,-WA', 'Austin,-TX'"""
133
- if page == 1:
134
- url = f"https://www.zillow.com/homes/{city_slug}_rb/"
135
- else:
136
- url = f"https://www.zillow.com/homes/{city_slug}_rb/{page}_p/"
137
- html = http_get(url, headers=HEADERS)
138
- m = re.search(r'<script id="__NEXT_DATA__"[^>]*>(.*?)</script>', html, re.DOTALL)
139
- d = json.loads(m.group(1))
140
- sps = d['props']['pageProps']['searchPageState']
141
- total = sps['categoryTotals']['cat1']['totalResultCount']
142
- listings = sps['cat1']['searchResults']['listResults']
143
- return listings, total
144
-
145
- all_listings = []
146
- listings, total = get_listings("San-Francisco,-CA")
147
- all_listings.extend(listings)
148
-
149
- max_pages = (total + 40) // 41
150
- for page in range(2, min(max_pages + 1, 6)): # cap at 5 pages for demo
151
- time.sleep(1.0) # polite delay
152
- page_listings, _ = get_listings("San-Francisco,-CA", page)
153
- all_listings.extend(page_listings)
154
-
155
- print(f"Fetched {len(all_listings)} of {total} listings")
156
- ```
157
-
158
- ---
159
-
160
- ## URL patterns that work (all confirmed)
161
-
162
- | URL pattern | Status | Notes |
163
- |---|---|---|
164
- | `/homes/{city}_rb/` | **Works** | For-sale listings |
165
- | `/homes/{city}_rb/{N}_p/` | **Works** | Pagination |
166
- | `/homes/for_sale/{city}/0-1800000_price/` | **Works** | Price filter (max) |
167
- | `/homes/3-_beds/{city}/` | **Works** | Bed count filter |
168
- | `/homes/{zip}_rb/` | **Works** | ZIP code search |
169
- | `/san-francisco-ca/rentals/` | **Works** | Rental listings |
170
- | `/san-francisco-ca/sold/` | **Works** | Recently sold |
171
- | `/homedetails/{address}/{zpid}_zpid/` | **403** | Single property detail |
172
- | `/async-create-search-page-state` | **403** | Internal search API |
173
- | `/graphql/` | **400/403** | GraphQL endpoint |
174
-
175
- ---
176
-
177
- ## Rental listings
178
-
179
- Rental search pages use the same `__NEXT_DATA__` structure. However, rental listing cards have a **different schema** — individual units are nested, not a flat price:
180
-
181
- ```python
182
- html = http_get("https://www.zillow.com/san-francisco-ca/rentals/", headers=HEADERS)
183
- listings = extract_listings(html)
184
-
185
- r = listings[0]
186
- # Multi-unit buildings:
187
- # r['units'] = [{'price': '$3,485+', 'beds': '0', 'roomForRent': False}, ...]
188
- # r['minBaseRent'] = 3485
189
- # r['maxBaseRent'] = 7130
190
- # r['availabilityCount'] = 23
191
-
192
- # Single-unit rentals:
193
- # r['price'] = '$2,500/mo'
194
- # r['unformattedPrice'] = 2500
195
-
196
- # Check which type:
197
- if r.get('isBuilding'):
198
- price_range = f"${r['minBaseRent']}–${r['maxBaseRent']}/mo"
199
- units = r.get('units', [])
200
- else:
201
- price = r.get('unformattedPrice') or r.get('hdpData', {}).get('homeInfo', {}).get('price')
202
- ```
203
-
204
- ---
205
-
206
- ## Sold listings
207
-
208
- Sold pages (`/sold/`) work identically. Key difference: `statusType` is `"RECENTLY_SOLD"` and price comes from `hdpData.homeInfo.priceForHDP` (not the `price` field which is `None` in sold cards):
209
-
210
- ```python
211
- html = http_get("https://www.zillow.com/san-francisco-ca/sold/", headers=HEADERS)
212
- listings = extract_listings(html)
213
-
214
- for l in listings:
215
- hi = l.get('hdpData', {}).get('homeInfo', {})
216
- sold_price = hi.get('priceForHDP') # actual sold price
217
- zestimate = hi.get('zestimate')
218
- tax_value = hi.get('taxAssessedValue')
219
- print(l['address'], f"${sold_price:,}", f"zest=${zestimate}")
220
- # 999 Green St APT 1702, San Francisco, CA 94133 $3,200,000 zest=$3,403,400
221
- # 1041 Vallejo St, San Francisco, CA 94133 $6,250,000 zest=None
222
- ```
223
-
224
- Total sold inventory in San Francisco: **18,109** (all time in Zillow's database, paginated 41/page).
225
-
226
- ---
227
-
228
- ## Bot detection behavior
229
-
230
- - **Zillow detects bot status server-side** and embeds `window.__USER_SESSION_INITIAL_STATE__` and `props.isBot` in the page.
231
- - In field testing, the page returned `isBot: False` with the Chrome User-Agent — **Zillow does not block the search pages**.
232
- - The page does embed `captcha` strings in the HTML (for the CAPTCHA challenge widget code), but the challenge is NOT triggered for search pages.
233
- - **`/homedetails/` pages do trigger blocking** — every property detail URL tested returned HTTP 403. This is enforced before serving HTML, not via JavaScript CAPTCHA.
234
- - Rate limiting: 3 rapid sequential requests to `/homes/` all succeeded. Observed no 429s. Add `time.sleep(0.5–1.0)` between pages as a courtesy.
235
-
236
- ---
237
-
238
- ## What you do NOT get from `http_get`
239
-
240
- Because property detail pages are blocked (403), you lose:
241
-
242
- - Full property description text
243
- - All listing photos (you only get `imgSrc` thumbnail from search)
244
- - Detailed home facts (year built, parking, HVAC, school scores)
245
- - Price history
246
- - Nearby comparable sales (comps)
247
- - Agent contact info
248
-
249
- **To get these**, you must navigate to the `/homedetails/` URL in a browser session. The browser is not blocked (Zillow relies on JS challenges and fingerprinting that only trigger in browser context).
250
-
251
- ---
252
-
253
- ## Alternative: Redfin (field-tested, more accessible)
254
-
255
- Redfin allows `http_get` with no blocking for both HTML pages and its internal API.
256
-
257
- ### Redfin JSON-LD per listing (easiest)
258
-
259
- Each Redfin search results page embeds one `<script type="application/ld+json">` per listing with structured property data:
260
-
261
- ```python
262
- import re, json
263
- from helpers import http_get
264
-
265
- HEADERS = {
266
- "User-Agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/120.0.0.0 Safari/537.36",
267
- "Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8",
268
- "Accept-Language": "en-US,en;q=0.9",
269
- }
270
-
271
- html = http_get(
272
- "https://www.redfin.com/city/17151/CA/San-Francisco/filter/property-type=house",
273
- headers=HEADERS
274
- )
275
- print(len(html)) # ~1.6 MB
276
-
277
- # Extract all SingleFamilyResidence JSON-LD entries
278
- properties = []
279
- for s in re.findall(r'<script type="application/ld\+json">(.*?)</script>', html, re.DOTALL):
280
- try:
281
- d = json.loads(s)
282
- if isinstance(d, list):
283
- for item in d:
284
- if item.get('@type') in ('SingleFamilyResidence', 'House', 'Residence', 'Apartment'):
285
- properties.append(item)
286
- except Exception:
287
- pass
288
-
289
- prop = properties[0]
290
- print("Name:", prop['name']) # "662 Hampshire St, San Francisco, CA 94110"
291
- print("Address:", prop['address'])
292
- # {'@type': 'PostalAddress', 'streetAddress': '662 Hampshire St',
293
- # 'addressLocality': 'San Francisco', 'addressRegion': 'CA',
294
- # 'postalCode': '94110', 'addressCountry': 'US'}
295
- print("Rooms:", prop['numberOfRooms']) # 3
296
- print("Floor size:", prop['floorSize']) # {'@type': 'QuantitativeValue', 'value': 3350, 'unitCode': 'FTK'}
297
- print("URL:", prop['url'])
298
- # https://www.redfin.com/CA/San-Francisco/662-Hampshire-St-94110/home/1533754
299
- ```
300
-
301
- Note: The JSON-LD schema does NOT include price (Redfin omits `offers` from the LD+JSON). Use the stingray API below for price.
302
-
303
- ### Redfin stingray API (structured JSON with price)
304
-
305
- Redfin's internal GIS/search API returns rich structured data including price, MLS ID, beds, baths, sqft, agent info, and remarks. Responses are prefixed with `{}&&` — strip it before parsing:
306
-
307
- ```python
308
- import json
309
- from helpers import http_get
310
-
311
- def redfin_search(region_id, region_type=6, num_homes=20, page=1, uipt="1,2,3,4,5,6"):
312
- """
313
- region_type: 6=city, 2=zipcode, 5=county
314
- uipt: property types (1=house, 2=condo, 3=townhouse, 4=multi-family, 5=land, 6=other)
315
- """
316
- url = (
317
- f"https://www.redfin.com/stingray/api/gis"
318
- f"?al=1&num_homes={num_homes}&ord=redfin-recommended-asc"
319
- f"&page_number={page}&region_id={region_id}&region_type={region_type}"
320
- f"&sf=1,2,3,5,6,7&status=9&uipt={uipt}&v=8"
321
- )
322
- raw = http_get(url, headers={
323
- "User-Agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/120.0.0.0 Safari/537.36",
324
- "Referer": "https://www.redfin.com/",
325
- "Accept": "*/*",
326
- })
327
- # Strip the {}&& CSRF prefix Redfin prepends to all API responses
328
- assert raw.startswith('{}&&'), f"Unexpected prefix: {raw[:10]}"
329
- return json.loads(raw[4:])
330
-
331
- data = redfin_search(region_id=17151) # 17151 = San Francisco, CA
332
- homes = data['payload']['homes']
333
-
334
- home = homes[0]
335
- print("Address:", home['streetLine']['value']) # "875 California St #703"
336
- print("City/State/Zip:", home['city'], home['state'], home['zip'])
337
- print("Price:", home['price']['value']) # 3300000
338
- print("Beds:", home['beds']) # 3
339
- print("Baths:", home['baths']) # 2.5
340
- print("Sqft:", home['sqFt']['value']) # 1828
341
- print("$/sqft:", home['pricePerSqFt']['value']) # 1805
342
- print("Lot size:", home['lotSize']['value']) # 9448
343
- print("Year built:", home['yearBuilt']['value']) # 2021
344
- print("Days on market:", home['dom']['value']) # 1
345
- print("MLS ID:", home['mlsId']['value']) # "426115342"
346
- print("MLS Status:", home['mlsStatus']) # "Active"
347
- print("Lat/Long:", home['latLong']['value'])
348
- print("URL:", home['url']) # "/CA/San-Francisco/..."
349
- print("Remarks:", home['listingRemarks'][:100])
350
- ```
351
-
352
- ### Redfin region IDs
353
-
354
- | City | region_id | region_type |
355
- |---|---|---|
356
- | San Francisco, CA | `17151` | `6` (city) |
357
- | Los Angeles, CA | `17152` | `6` |
358
- | New York, NY | `17834` | `6` |
359
- | Seattle, WA | `16163` | `6` |
360
-
361
- To find other region IDs: search on Redfin, look at the URL (e.g. `/city/17151/CA/San-Francisco`) — the number is the region_id.
362
-
363
- ### Redfin stingray response structure
364
-
365
- ```
366
- data['payload']['homes'][i]
367
- .streetLine.value → street address string
368
- .city / .state / .zip → strings
369
- .price.value → int (asking price in dollars)
370
- .sqFt.value → int (square feet)
371
- .pricePerSqFt.value → int
372
- .beds → int
373
- .baths → float (2.5 = 2 full + 1 half)
374
- .fullBaths / .partialBaths → ints
375
- .lotSize.value → int (sq ft)
376
- .yearBuilt.value → int
377
- .dom.value → days on market (int)
378
- .mlsId.value → MLS listing number (string)
379
- .mlsStatus → "Active", "Pending", etc.
380
- .listingId → Redfin internal int
381
- .propertyId → Redfin internal int
382
- .latLong.value → {'latitude': float, 'longitude': float}
383
- .url → relative URL "/CA/San-Francisco/..."
384
- .listingRemarks → description text (may be truncated)
385
- .keyFacts → [{'description': str, 'rank': int}]
386
- .listingTags → ['SWEEPING CITY VIEWS', ...]
387
- .hoa.value → HOA monthly (int)
388
- .location.value → neighborhood name string
389
- .sashes → [{'sashTypeName': 'New'/'Price Drop'/...}]
390
- .photos.value → photo token string
391
- .numPictures → int
392
- ```
393
-
394
- ---
395
-
396
- ## Alternative APIs (no scraping required)
397
-
398
- If you need property data without scraping Zillow or Redfin at scale:
399
-
400
- | API | Free tier | Key data |
401
- |---|---|---|
402
- | **ATTOM Data** (attomdata.com) | Trial available | Ownership, AVM, tax, sale history, building characteristics |
403
- | **Rentcast** (rentcastapi.com) | 50 req/mo free | Rental estimates, comps, market data |
404
- | **RapidAPI: Zillow56** | ~100 req/mo free | Wraps Zillow data (unofficial, use at own risk) |
405
- | **HouseCanary** | Paid | AVM, market risk, rental value |
406
- | **Redfin API** (unofficial, above) | Unlimited | MLS listing data |
407
- | **US Census / HUD** | Free, no key | Median home values by geography, affordability |
408
-
409
- ---
410
-
411
- ## Gotchas
412
-
413
- - **Single User-Agent word triggers 403.** `http_get` passes `"Mozilla/5.0"` as default User-Agent — this gets blocked. Always pass the full Chrome UA via the `headers=` argument.
414
-
415
- - **`price` field is `None` for sold and rental multi-unit listings.** Use `unformattedPrice` for for-sale, `hdpData.homeInfo.priceForHDP` for sold, and `minBaseRent`/`maxBaseRent` for rentals.
416
-
417
- - **`/homedetails/` is unconditionally blocked.** Tested with full browser headers, Referer, Sec-Fetch-* headers — all return HTTP 403. Only the browser bypasses this.
418
-
419
- - **41 listings per page, hardcoded.** Zillow always returns exactly 41 results per page from `listResults`. `mapResults` was empty in all tests (server-side response only).
420
-
421
- - **`isBot: False` doesn't mean you're safe.** Zillow correctly identifies server-side requests and blocks `/homedetails/`. The `isBot` flag in `__NEXT_DATA__` is `False` for search pages but the restriction is enforced at route level for detail pages.
422
-
423
- - **Captcha strings in HTML do not mean CAPTCHA is active.** The search page includes the captcha widget JavaScript (for lazy loading if needed) but does not serve a challenge — confirmed by successfully parsing listing data from the same HTML.
424
-
425
- - **Redfin `{}&&` prefix on all API responses.** Strip with `raw[4:]` before `json.loads()`. If the prefix changes, the assertion fails explicitly.
426
-
427
- - **Redfin JSON-LD omits price.** The `SingleFamilyResidence` schema objects do not include an `offers` field — use the stingray API for pricing.
428
-
429
- - **Redfin stingray API returns all listing fields wrapped in `{'value': X, 'level': N}` dicts.** Always read `.value` for numeric fields (e.g. `home['price']['value']`, not `home['price']`). Level `1` means data is public; `2` means potentially restricted.
430
-
431
- - **Zillow total count can exceed 800 but pagination caps at page ~20.** Zillow caps search results at around 800 listings even if `totalResultCount` shows 1037. Narrow by ZIP code, neighborhood, or price range to stay within bounds.
432
-
433
- - **URL filter syntax for Zillow:** Beds: `3-_beds` prefix; price: `0-1800000_price` suffix; ZIP: use `{zip}_rb` instead of city slug. Test by building the URL in a browser and copying the pattern.
1
+ # Zillow — Scraping & Data Extraction
2
+
3
+ Field-tested against `www.zillow.com` on 2026-04-18 using `http_get` (no browser).
4
+
5
+ ## Quick summary
6
+
7
+ - **Search listing pages (`/homes/`, `/sold/`, `/rentals/`)** — `http_get` works with full Chrome headers. Returns ~973 KB HTML with all listing data embedded in `__NEXT_DATA__` JSON.
8
+ - **Individual property detail pages (`/homedetails/`)** — `http_get` returns **HTTP 403** unconditionally. No header combination bypasses this.
9
+ - **Internal API endpoints** (`/async-create-search-page-state`, `/graphql/`) — **403** for all server-side requests regardless of headers.
10
+ - **Redfin** — `http_get` works; HTML contains both JSON-LD per listing and a stingray JSON API.
11
+
12
+ ---
13
+
14
+ ## What works: search listing pages via `__NEXT_DATA__`
15
+
16
+ Zillow search pages embed all listing data in `<script id="__NEXT_DATA__">`. This is standard Next.js SSR output — it is the same data Zillow's React app hydrates from.
17
+
18
+ **Required headers** — The single-word User-Agent (`"Mozilla/5.0"`) used by `http_get` internally gets 403. You must pass a full Chrome UA plus Accept/Accept-Language headers:
19
+
20
+ ```python
21
+ import re, json
22
+ from helpers import http_get
23
+
24
+ HEADERS = {
25
+ "User-Agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/120.0.0.0 Safari/537.36",
26
+ "Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8",
27
+ "Accept-Language": "en-US,en;q=0.9",
28
+ }
29
+
30
+ def extract_listings(html):
31
+ """Parse Zillow __NEXT_DATA__ and return list of listing dicts."""
32
+ m = re.search(r'<script id="__NEXT_DATA__"[^>]*>(.*?)</script>', html, re.DOTALL)
33
+ if not m:
34
+ return []
35
+ d = json.loads(m.group(1))
36
+ sps = d['props']['pageProps']['searchPageState']
37
+ return sps['cat1']['searchResults']['listResults']
38
+
39
+ html = http_get("https://www.zillow.com/homes/San-Francisco,-CA_rb/", headers=HEADERS)
40
+ listings = extract_listings(html)
41
+ print(len(listings)) # 41 — always 41 per page
42
+ ```
43
+
44
+ ### Fields available in each listing card
45
+
46
+ The `listResults` array is the canonical source. Each entry includes:
47
+
48
+ | Field | Source | Example |
49
+ |---|---|---|
50
+ | `zpid` | listing | `15081707` |
51
+ | `address` | listing | `"212 Spruce St, San Francisco, CA 94118"` |
52
+ | `addressStreet`, `addressCity`, `addressState`, `addressZipcode` | listing | split address components |
53
+ | `price` | listing | `"$4,395,000"` (formatted string) |
54
+ | `unformattedPrice` | listing | `4395000` (int, use for math) |
55
+ | `beds` | listing | `4` |
56
+ | `baths` | listing | `4` |
57
+ | `area` | listing | `4133` (sqft) |
58
+ | `latLong` | listing | `{'latitude': 37.78867, 'longitude': -122.45361}` |
59
+ | `statusType` | listing | `"FOR_SALE"` / `"FOR_RENT"` / `"RECENTLY_SOLD"` |
60
+ | `detailUrl` | listing | full `https://www.zillow.com/homedetails/...` URL |
61
+ | `zestimate` | listing | `4857200` (Zillow AI estimate, int) |
62
+ | `imgSrc` | listing | thumbnail URL |
63
+ | `has3DModel` | listing | `True`/`False` |
64
+ | `hasOpenHouse` | listing | `True`/`False` |
65
+ | `openHouseStartDate`, `openHouseEndDate` | listing | ISO strings |
66
+ | `isFeaturedListing` | listing | sponsored/featured flag |
67
+ | `brokerName` | listing | `"Sotheby's International Realty"` |
68
+ | `statusText` | listing | `"FOR SALE"` display string |
69
+ | `hdpData.homeInfo.price` | nested | raw price int (matches `unformattedPrice`) |
70
+ | `hdpData.homeInfo.zestimate` | nested | raw zestimate int |
71
+ | `hdpData.homeInfo.rentZestimate` | nested | monthly rent estimate |
72
+ | `hdpData.homeInfo.homeType` | nested | `"SINGLE_FAMILY"`, `"CONDO"`, `"TOWNHOUSE"` etc. |
73
+ | `hdpData.homeInfo.daysOnZillow` | nested | int |
74
+ | `hdpData.homeInfo.taxAssessedValue` | nested | int |
75
+ | `hdpData.homeInfo.lotAreaValue` + `lotAreaUnit` | nested | e.g. `2957.724`, `"sqft"` |
76
+ | `hdpData.homeInfo.priceForHDP` | nested | reliable sold price for recently-sold listings |
77
+
78
+ ```python
79
+ # Full extraction snippet
80
+ listing = listings[0]
81
+ hi = listing.get('hdpData', {}).get('homeInfo', {})
82
+
83
+ record = {
84
+ "zpid": listing['zpid'],
85
+ "address": listing['address'],
86
+ "price_raw": listing.get('unformattedPrice') or hi.get('price'),
87
+ "beds": listing.get('beds'),
88
+ "baths": listing.get('baths'),
89
+ "sqft": listing.get('area'),
90
+ "lat": listing['latLong']['latitude'],
91
+ "lon": listing['latLong']['longitude'],
92
+ "status": listing['statusType'],
93
+ "zestimate": listing.get('zestimate'),
94
+ "rent_zest": hi.get('rentZestimate'),
95
+ "home_type": hi.get('homeType'),
96
+ "days_listed": hi.get('daysOnZillow'),
97
+ "tax_assessed": hi.get('taxAssessedValue'),
98
+ "url": listing['detailUrl'],
99
+ }
100
+ ```
101
+
102
+ ### Total result count and pagination
103
+
104
+ ```python
105
+ d = json.loads(re.search(r'<script id="__NEXT_DATA__"[^>]*>(.*?)</script>', html, re.DOTALL).group(1))
106
+ sps = d['props']['pageProps']['searchPageState']
107
+
108
+ # Total listings in this search
109
+ total = sps['categoryTotals']['cat1']['totalResultCount']
110
+ print(total) # 1037
111
+
112
+ # Each page returns exactly 41 listings. Add /<N>_p/ for subsequent pages:
113
+ # Page 2: https://www.zillow.com/homes/San-Francisco,-CA_rb/2_p/
114
+ # Page 3: https://www.zillow.com/homes/San-Francisco,-CA_rb/3_p/
115
+
116
+ max_pages = (total + 40) // 41
117
+ ```
118
+
119
+ ### Scrape all pages
120
+
121
+ ```python
122
+ import re, json, time
123
+ from helpers import http_get
124
+
125
+ HEADERS = {
126
+ "User-Agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/120.0.0.0 Safari/537.36",
127
+ "Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8",
128
+ "Accept-Language": "en-US,en;q=0.9",
129
+ }
130
+
131
+ def get_listings(city_slug, page=1):
132
+ """city_slug: e.g. 'San-Francisco,-CA', 'Seattle,-WA', 'Austin,-TX'"""
133
+ if page == 1:
134
+ url = f"https://www.zillow.com/homes/{city_slug}_rb/"
135
+ else:
136
+ url = f"https://www.zillow.com/homes/{city_slug}_rb/{page}_p/"
137
+ html = http_get(url, headers=HEADERS)
138
+ m = re.search(r'<script id="__NEXT_DATA__"[^>]*>(.*?)</script>', html, re.DOTALL)
139
+ d = json.loads(m.group(1))
140
+ sps = d['props']['pageProps']['searchPageState']
141
+ total = sps['categoryTotals']['cat1']['totalResultCount']
142
+ listings = sps['cat1']['searchResults']['listResults']
143
+ return listings, total
144
+
145
+ all_listings = []
146
+ listings, total = get_listings("San-Francisco,-CA")
147
+ all_listings.extend(listings)
148
+
149
+ max_pages = (total + 40) // 41
150
+ for page in range(2, min(max_pages + 1, 6)): # cap at 5 pages for demo
151
+ time.sleep(1.0) # polite delay
152
+ page_listings, _ = get_listings("San-Francisco,-CA", page)
153
+ all_listings.extend(page_listings)
154
+
155
+ print(f"Fetched {len(all_listings)} of {total} listings")
156
+ ```
157
+
158
+ ---
159
+
160
+ ## URL patterns that work (all confirmed)
161
+
162
+ | URL pattern | Status | Notes |
163
+ |---|---|---|
164
+ | `/homes/{city}_rb/` | **Works** | For-sale listings |
165
+ | `/homes/{city}_rb/{N}_p/` | **Works** | Pagination |
166
+ | `/homes/for_sale/{city}/0-1800000_price/` | **Works** | Price filter (max) |
167
+ | `/homes/3-_beds/{city}/` | **Works** | Bed count filter |
168
+ | `/homes/{zip}_rb/` | **Works** | ZIP code search |
169
+ | `/san-francisco-ca/rentals/` | **Works** | Rental listings |
170
+ | `/san-francisco-ca/sold/` | **Works** | Recently sold |
171
+ | `/homedetails/{address}/{zpid}_zpid/` | **403** | Single property detail |
172
+ | `/async-create-search-page-state` | **403** | Internal search API |
173
+ | `/graphql/` | **400/403** | GraphQL endpoint |
174
+
175
+ ---
176
+
177
+ ## Rental listings
178
+
179
+ Rental search pages use the same `__NEXT_DATA__` structure. However, rental listing cards have a **different schema** — individual units are nested, not a flat price:
180
+
181
+ ```python
182
+ html = http_get("https://www.zillow.com/san-francisco-ca/rentals/", headers=HEADERS)
183
+ listings = extract_listings(html)
184
+
185
+ r = listings[0]
186
+ # Multi-unit buildings:
187
+ # r['units'] = [{'price': '$3,485+', 'beds': '0', 'roomForRent': False}, ...]
188
+ # r['minBaseRent'] = 3485
189
+ # r['maxBaseRent'] = 7130
190
+ # r['availabilityCount'] = 23
191
+
192
+ # Single-unit rentals:
193
+ # r['price'] = '$2,500/mo'
194
+ # r['unformattedPrice'] = 2500
195
+
196
+ # Check which type:
197
+ if r.get('isBuilding'):
198
+ price_range = f"${r['minBaseRent']}–${r['maxBaseRent']}/mo"
199
+ units = r.get('units', [])
200
+ else:
201
+ price = r.get('unformattedPrice') or r.get('hdpData', {}).get('homeInfo', {}).get('price')
202
+ ```
203
+
204
+ ---
205
+
206
+ ## Sold listings
207
+
208
+ Sold pages (`/sold/`) work identically. Key difference: `statusType` is `"RECENTLY_SOLD"` and price comes from `hdpData.homeInfo.priceForHDP` (not the `price` field which is `None` in sold cards):
209
+
210
+ ```python
211
+ html = http_get("https://www.zillow.com/san-francisco-ca/sold/", headers=HEADERS)
212
+ listings = extract_listings(html)
213
+
214
+ for l in listings:
215
+ hi = l.get('hdpData', {}).get('homeInfo', {})
216
+ sold_price = hi.get('priceForHDP') # actual sold price
217
+ zestimate = hi.get('zestimate')
218
+ tax_value = hi.get('taxAssessedValue')
219
+ print(l['address'], f"${sold_price:,}", f"zest=${zestimate}")
220
+ # 999 Green St APT 1702, San Francisco, CA 94133 $3,200,000 zest=$3,403,400
221
+ # 1041 Vallejo St, San Francisco, CA 94133 $6,250,000 zest=None
222
+ ```
223
+
224
+ Total sold inventory in San Francisco: **18,109** (all time in Zillow's database, paginated 41/page).
225
+
226
+ ---
227
+
228
+ ## Bot detection behavior
229
+
230
+ - **Zillow detects bot status server-side** and embeds `window.__USER_SESSION_INITIAL_STATE__` and `props.isBot` in the page.
231
+ - In field testing, the page returned `isBot: False` with the Chrome User-Agent — **Zillow does not block the search pages**.
232
+ - The page does embed `captcha` strings in the HTML (for the CAPTCHA challenge widget code), but the challenge is NOT triggered for search pages.
233
+ - **`/homedetails/` pages do trigger blocking** — every property detail URL tested returned HTTP 403. This is enforced before serving HTML, not via JavaScript CAPTCHA.
234
+ - Rate limiting: 3 rapid sequential requests to `/homes/` all succeeded. Observed no 429s. Add `time.sleep(0.5–1.0)` between pages as a courtesy.
235
+
236
+ ---
237
+
238
+ ## What you do NOT get from `http_get`
239
+
240
+ Because property detail pages are blocked (403), you lose:
241
+
242
+ - Full property description text
243
+ - All listing photos (you only get `imgSrc` thumbnail from search)
244
+ - Detailed home facts (year built, parking, HVAC, school scores)
245
+ - Price history
246
+ - Nearby comparable sales (comps)
247
+ - Agent contact info
248
+
249
+ **To get these**, you must navigate to the `/homedetails/` URL in a browser session. The browser is not blocked (Zillow relies on JS challenges and fingerprinting that only trigger in browser context).
250
+
251
+ ---
252
+
253
+ ## Alternative: Redfin (field-tested, more accessible)
254
+
255
+ Redfin allows `http_get` with no blocking for both HTML pages and its internal API.
256
+
257
+ ### Redfin JSON-LD per listing (easiest)
258
+
259
+ Each Redfin search results page embeds one `<script type="application/ld+json">` per listing with structured property data:
260
+
261
+ ```python
262
+ import re, json
263
+ from helpers import http_get
264
+
265
+ HEADERS = {
266
+ "User-Agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/120.0.0.0 Safari/537.36",
267
+ "Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8",
268
+ "Accept-Language": "en-US,en;q=0.9",
269
+ }
270
+
271
+ html = http_get(
272
+ "https://www.redfin.com/city/17151/CA/San-Francisco/filter/property-type=house",
273
+ headers=HEADERS
274
+ )
275
+ print(len(html)) # ~1.6 MB
276
+
277
+ # Extract all SingleFamilyResidence JSON-LD entries
278
+ properties = []
279
+ for s in re.findall(r'<script type="application/ld\+json">(.*?)</script>', html, re.DOTALL):
280
+ try:
281
+ d = json.loads(s)
282
+ if isinstance(d, list):
283
+ for item in d:
284
+ if item.get('@type') in ('SingleFamilyResidence', 'House', 'Residence', 'Apartment'):
285
+ properties.append(item)
286
+ except Exception:
287
+ pass
288
+
289
+ prop = properties[0]
290
+ print("Name:", prop['name']) # "662 Hampshire St, San Francisco, CA 94110"
291
+ print("Address:", prop['address'])
292
+ # {'@type': 'PostalAddress', 'streetAddress': '662 Hampshire St',
293
+ # 'addressLocality': 'San Francisco', 'addressRegion': 'CA',
294
+ # 'postalCode': '94110', 'addressCountry': 'US'}
295
+ print("Rooms:", prop['numberOfRooms']) # 3
296
+ print("Floor size:", prop['floorSize']) # {'@type': 'QuantitativeValue', 'value': 3350, 'unitCode': 'FTK'}
297
+ print("URL:", prop['url'])
298
+ # https://www.redfin.com/CA/San-Francisco/662-Hampshire-St-94110/home/1533754
299
+ ```
300
+
301
+ Note: The JSON-LD schema does NOT include price (Redfin omits `offers` from the LD+JSON). Use the stingray API below for price.
302
+
303
+ ### Redfin stingray API (structured JSON with price)
304
+
305
+ Redfin's internal GIS/search API returns rich structured data including price, MLS ID, beds, baths, sqft, agent info, and remarks. Responses are prefixed with `{}&&` — strip it before parsing:
306
+
307
+ ```python
308
+ import json
309
+ from helpers import http_get
310
+
311
+ def redfin_search(region_id, region_type=6, num_homes=20, page=1, uipt="1,2,3,4,5,6"):
312
+ """
313
+ region_type: 6=city, 2=zipcode, 5=county
314
+ uipt: property types (1=house, 2=condo, 3=townhouse, 4=multi-family, 5=land, 6=other)
315
+ """
316
+ url = (
317
+ f"https://www.redfin.com/stingray/api/gis"
318
+ f"?al=1&num_homes={num_homes}&ord=redfin-recommended-asc"
319
+ f"&page_number={page}&region_id={region_id}&region_type={region_type}"
320
+ f"&sf=1,2,3,5,6,7&status=9&uipt={uipt}&v=8"
321
+ )
322
+ raw = http_get(url, headers={
323
+ "User-Agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/120.0.0.0 Safari/537.36",
324
+ "Referer": "https://www.redfin.com/",
325
+ "Accept": "*/*",
326
+ })
327
+ # Strip the {}&& CSRF prefix Redfin prepends to all API responses
328
+ assert raw.startswith('{}&&'), f"Unexpected prefix: {raw[:10]}"
329
+ return json.loads(raw[4:])
330
+
331
+ data = redfin_search(region_id=17151) # 17151 = San Francisco, CA
332
+ homes = data['payload']['homes']
333
+
334
+ home = homes[0]
335
+ print("Address:", home['streetLine']['value']) # "875 California St #703"
336
+ print("City/State/Zip:", home['city'], home['state'], home['zip'])
337
+ print("Price:", home['price']['value']) # 3300000
338
+ print("Beds:", home['beds']) # 3
339
+ print("Baths:", home['baths']) # 2.5
340
+ print("Sqft:", home['sqFt']['value']) # 1828
341
+ print("$/sqft:", home['pricePerSqFt']['value']) # 1805
342
+ print("Lot size:", home['lotSize']['value']) # 9448
343
+ print("Year built:", home['yearBuilt']['value']) # 2021
344
+ print("Days on market:", home['dom']['value']) # 1
345
+ print("MLS ID:", home['mlsId']['value']) # "426115342"
346
+ print("MLS Status:", home['mlsStatus']) # "Active"
347
+ print("Lat/Long:", home['latLong']['value'])
348
+ print("URL:", home['url']) # "/CA/San-Francisco/..."
349
+ print("Remarks:", home['listingRemarks'][:100])
350
+ ```
351
+
352
+ ### Redfin region IDs
353
+
354
+ | City | region_id | region_type |
355
+ |---|---|---|
356
+ | San Francisco, CA | `17151` | `6` (city) |
357
+ | Los Angeles, CA | `17152` | `6` |
358
+ | New York, NY | `17834` | `6` |
359
+ | Seattle, WA | `16163` | `6` |
360
+
361
+ To find other region IDs: search on Redfin, look at the URL (e.g. `/city/17151/CA/San-Francisco`) — the number is the region_id.
362
+
363
+ ### Redfin stingray response structure
364
+
365
+ ```
366
+ data['payload']['homes'][i]
367
+ .streetLine.value → street address string
368
+ .city / .state / .zip → strings
369
+ .price.value → int (asking price in dollars)
370
+ .sqFt.value → int (square feet)
371
+ .pricePerSqFt.value → int
372
+ .beds → int
373
+ .baths → float (2.5 = 2 full + 1 half)
374
+ .fullBaths / .partialBaths → ints
375
+ .lotSize.value → int (sq ft)
376
+ .yearBuilt.value → int
377
+ .dom.value → days on market (int)
378
+ .mlsId.value → MLS listing number (string)
379
+ .mlsStatus → "Active", "Pending", etc.
380
+ .listingId → Redfin internal int
381
+ .propertyId → Redfin internal int
382
+ .latLong.value → {'latitude': float, 'longitude': float}
383
+ .url → relative URL "/CA/San-Francisco/..."
384
+ .listingRemarks → description text (may be truncated)
385
+ .keyFacts → [{'description': str, 'rank': int}]
386
+ .listingTags → ['SWEEPING CITY VIEWS', ...]
387
+ .hoa.value → HOA monthly (int)
388
+ .location.value → neighborhood name string
389
+ .sashes → [{'sashTypeName': 'New'/'Price Drop'/...}]
390
+ .photos.value → photo token string
391
+ .numPictures → int
392
+ ```
393
+
394
+ ---
395
+
396
+ ## Alternative APIs (no scraping required)
397
+
398
+ If you need property data without scraping Zillow or Redfin at scale:
399
+
400
+ | API | Free tier | Key data |
401
+ |---|---|---|
402
+ | **ATTOM Data** (attomdata.com) | Trial available | Ownership, AVM, tax, sale history, building characteristics |
403
+ | **Rentcast** (rentcastapi.com) | 50 req/mo free | Rental estimates, comps, market data |
404
+ | **RapidAPI: Zillow56** | ~100 req/mo free | Wraps Zillow data (unofficial, use at own risk) |
405
+ | **HouseCanary** | Paid | AVM, market risk, rental value |
406
+ | **Redfin API** (unofficial, above) | Unlimited | MLS listing data |
407
+ | **US Census / HUD** | Free, no key | Median home values by geography, affordability |
408
+
409
+ ---
410
+
411
+ ## Gotchas
412
+
413
+ - **Single User-Agent word triggers 403.** `http_get` passes `"Mozilla/5.0"` as default User-Agent — this gets blocked. Always pass the full Chrome UA via the `headers=` argument.
414
+
415
+ - **`price` field is `None` for sold and rental multi-unit listings.** Use `unformattedPrice` for for-sale, `hdpData.homeInfo.priceForHDP` for sold, and `minBaseRent`/`maxBaseRent` for rentals.
416
+
417
+ - **`/homedetails/` is unconditionally blocked.** Tested with full browser headers, Referer, Sec-Fetch-* headers — all return HTTP 403. Only the browser bypasses this.
418
+
419
+ - **41 listings per page, hardcoded.** Zillow always returns exactly 41 results per page from `listResults`. `mapResults` was empty in all tests (server-side response only).
420
+
421
+ - **`isBot: False` doesn't mean you're safe.** Zillow correctly identifies server-side requests and blocks `/homedetails/`. The `isBot` flag in `__NEXT_DATA__` is `False` for search pages but the restriction is enforced at route level for detail pages.
422
+
423
+ - **Captcha strings in HTML do not mean CAPTCHA is active.** The search page includes the captcha widget JavaScript (for lazy loading if needed) but does not serve a challenge — confirmed by successfully parsing listing data from the same HTML.
424
+
425
+ - **Redfin `{}&&` prefix on all API responses.** Strip with `raw[4:]` before `json.loads()`. If the prefix changes, the assertion fails explicitly.
426
+
427
+ - **Redfin JSON-LD omits price.** The `SingleFamilyResidence` schema objects do not include an `offers` field — use the stingray API for pricing.
428
+
429
+ - **Redfin stingray API returns all listing fields wrapped in `{'value': X, 'level': N}` dicts.** Always read `.value` for numeric fields (e.g. `home['price']['value']`, not `home['price']`). Level `1` means data is public; `2` means potentially restricted.
430
+
431
+ - **Zillow total count can exceed 800 but pagination caps at page ~20.** Zillow caps search results at around 800 listings even if `totalResultCount` shows 1037. Narrow by ZIP code, neighborhood, or price range to stay within bounds.
432
+
433
+ - **URL filter syntax for Zillow:** Beds: `3-_beds` prefix; price: `0-1800000_price` suffix; ZIP: use `{zip}_rb` instead of city slug. Test by building the URL in a browser and copying the pattern.