@pencil-agent/nano-pencil 2.0.1 → 2.0.3

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
Files changed (188) hide show
  1. package/README.md +267 -267
  2. package/dist/build-meta.json +3 -3
  3. package/dist/core/export-html/AGENT.md +11 -11
  4. package/dist/core/export-html/template.css +971 -971
  5. package/dist/core/export-html/template.html +54 -54
  6. package/dist/core/model/custom-providers.js +1 -1
  7. package/dist/core/model-registry.js +5 -5
  8. package/dist/extensions/builtin/AGENT.md +115 -115
  9. package/dist/extensions/builtin/browser/AGENT.md +17 -17
  10. package/dist/extensions/builtin/browser/agent-workspace/agent_helpers.py +12 -12
  11. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/amazon/product-search.md +198 -198
  12. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/archive-org/scraping.md +341 -341
  13. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/arxiv/scraping.md +311 -311
  14. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/arxiv-bulk/scraping.md +333 -333
  15. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/atlas/overview.md +70 -70
  16. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/booking-com/scraping.md +578 -578
  17. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/capterra/scraping.md +440 -440
  18. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/centilebrain/generate-estimates.md +110 -110
  19. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/coingecko/scraping.md +325 -325
  20. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/coinmarketcap/scraping.md +463 -463
  21. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/coursera/scraping.md +360 -360
  22. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/craigslist/scraping.md +390 -390
  23. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/crossref/scraping.md +568 -568
  24. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/dev-to/scraping.md +323 -323
  25. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/duckduckgo/scraping.md +349 -349
  26. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/ebay/scraping.md +435 -435
  27. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/etsy/scraping.md +506 -506
  28. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/eventbrite/scraping.md +363 -363
  29. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/expedia/automation.md +168 -168
  30. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/facebook/groups.md +236 -236
  31. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/facebook/pages.md +295 -295
  32. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/framer/editor.md +108 -108
  33. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/fred/scraping.md +493 -493
  34. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/g2/scraping.md +580 -580
  35. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/genius/scraping.md +511 -511
  36. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/github/repo-actions.md +65 -65
  37. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/github/scraping.md +184 -184
  38. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/glassdoor/scraping.md +543 -543
  39. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/gmail/compose.md +122 -122
  40. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/goodreads/scraping.md +461 -461
  41. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/gutenberg/scraping.md +383 -383
  42. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/hackernews/scraping.md +243 -243
  43. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/howlongtobeat/scraping.md +473 -473
  44. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/imdb/scraping.md +271 -271
  45. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/itch-io/scraping.md +436 -436
  46. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/job-boards/indeed-glassdoor.md +1021 -1021
  47. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/letterboxd/scraping.md +349 -349
  48. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/linkedin/invitation-manager.md +109 -109
  49. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/loom/folder-enumeration.md +170 -170
  50. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/macrotrends/scraping.md +537 -537
  51. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/medium/article-hydration.md +120 -120
  52. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/medium/scraping.md +414 -414
  53. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/metacritic/scraping.md +477 -477
  54. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/musicbrainz/scraping.md +478 -478
  55. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/nasa/scraping.md +339 -339
  56. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/news-aggregation/multi-source.md +205 -205
  57. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/open-library/scraping.md +472 -472
  58. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/openalex/scraping.md +470 -470
  59. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/openstreetmap/scraping.md +490 -490
  60. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/package-registries/npm-pypi.md +478 -478
  61. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/polymarket/scraping.md +234 -234
  62. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/producthunt/scraping.md +307 -307
  63. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/pubmed/scraping.md +421 -421
  64. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/quora/scraping.md +364 -364
  65. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/rawg/scraping.md +352 -352
  66. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/reddit/scraping.md +124 -124
  67. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/rest-countries/scraping.md +233 -233
  68. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/sec-edgar/scraping.md +361 -361
  69. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/shopify-admin/README.md +36 -36
  70. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/shopify-admin/embedded-apps.md +72 -72
  71. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/shopify-admin/knowledge-base.md +109 -109
  72. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/shopify-admin/polaris-inputs.md +137 -137
  73. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/soundcloud/scraping.md +362 -362
  74. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/spotify/scraping.md +339 -339
  75. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/stackoverflow/scraping.md +435 -435
  76. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/steam/scraping.md +575 -575
  77. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/substack/scraping.md +338 -338
  78. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/thetechgeeks/pricing.md +52 -52
  79. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/tiktok/upload.md +107 -107
  80. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/tradingview/scraping.md +309 -309
  81. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/trello/boards-and-lists.md +88 -88
  82. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/trustpilot/scraping.md +375 -375
  83. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/walmart/scraping.md +444 -444
  84. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/wayback-machine/scraping.md +306 -306
  85. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/weather/scraping.md +398 -398
  86. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/wellfound/scraping.md +596 -596
  87. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/world-bank/scraping.md +356 -356
  88. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/xiaohongshu/scraping.md +84 -84
  89. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/youtube/scraping.md +418 -418
  90. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/zillow/scraping.md +433 -433
  91. package/dist/extensions/builtin/browser/browser.md +73 -73
  92. package/dist/extensions/builtin/browser/install.md +142 -142
  93. package/dist/extensions/builtin/browser/interaction-skills/connection.md +48 -48
  94. package/dist/extensions/builtin/browser/interaction-skills/cookies.md +3 -3
  95. package/dist/extensions/builtin/browser/interaction-skills/cross-origin-iframes.md +3 -3
  96. package/dist/extensions/builtin/browser/interaction-skills/dialogs.md +64 -64
  97. package/dist/extensions/builtin/browser/interaction-skills/downloads.md +3 -3
  98. package/dist/extensions/builtin/browser/interaction-skills/drag-and-drop.md +3 -3
  99. package/dist/extensions/builtin/browser/interaction-skills/dropdowns.md +3 -3
  100. package/dist/extensions/builtin/browser/interaction-skills/iframes.md +3 -3
  101. package/dist/extensions/builtin/browser/interaction-skills/network-requests.md +3 -3
  102. package/dist/extensions/builtin/browser/interaction-skills/print-as-pdf.md +3 -3
  103. package/dist/extensions/builtin/browser/interaction-skills/profile-sync.md +90 -90
  104. package/dist/extensions/builtin/browser/interaction-skills/screenshots.md +17 -17
  105. package/dist/extensions/builtin/browser/interaction-skills/scrolling.md +3 -3
  106. package/dist/extensions/builtin/browser/interaction-skills/shadow-dom.md +3 -3
  107. package/dist/extensions/builtin/browser/interaction-skills/tabs.md +69 -69
  108. package/dist/extensions/builtin/browser/interaction-skills/uploads.md +1 -1
  109. package/dist/extensions/builtin/browser/interaction-skills/viewport.md +3 -3
  110. package/dist/extensions/builtin/browser/src/browser_harness/AGENT.md +15 -15
  111. package/dist/extensions/builtin/browser/src/browser_harness/__init__.py +8 -8
  112. package/dist/extensions/builtin/browser/src/browser_harness/_ipc.py +90 -90
  113. package/dist/extensions/builtin/browser/src/browser_harness/admin.py +722 -722
  114. package/dist/extensions/builtin/browser/src/browser_harness/daemon.py +328 -328
  115. package/dist/extensions/builtin/browser/src/browser_harness/helpers.py +396 -396
  116. package/dist/extensions/builtin/browser/src/browser_harness/run.py +103 -103
  117. package/dist/extensions/builtin/debug/index.js +9 -9
  118. package/dist/extensions/builtin/discipline/skills/brainstorming/SKILL.md +33 -33
  119. package/dist/extensions/builtin/discipline/skills/executing-plans/SKILL.md +25 -25
  120. package/dist/extensions/builtin/discipline/skills/finishing-development-branch/SKILL.md +25 -25
  121. package/dist/extensions/builtin/discipline/skills/receiving-code-review/SKILL.md +22 -22
  122. package/dist/extensions/builtin/discipline/skills/requesting-code-review/SKILL.md +31 -31
  123. package/dist/extensions/builtin/discipline/skills/systematic-debugging/SKILL.md +28 -28
  124. package/dist/extensions/builtin/discipline/skills/test-driven-development/SKILL.md +32 -32
  125. package/dist/extensions/builtin/discipline/skills/using-git-worktrees/SKILL.md +25 -25
  126. package/dist/extensions/builtin/discipline/skills/verification-before-completion/SKILL.md +27 -27
  127. package/dist/extensions/builtin/discipline/skills/writing-plans/SKILL.md +26 -26
  128. package/dist/extensions/builtin/goal/README.md +67 -67
  129. package/dist/extensions/builtin/goal/index.js +6 -6
  130. package/dist/extensions/builtin/grub/README.md +112 -112
  131. package/dist/extensions/builtin/link-world/agent-workspace/README.md +16 -16
  132. package/dist/extensions/builtin/link-world/internet-search/internet-search.md +65 -65
  133. package/dist/extensions/builtin/link-world/link-world-agent.md +82 -82
  134. package/dist/extensions/builtin/link-world/linkworld.md +313 -313
  135. package/dist/extensions/builtin/link-world/network-routing/network-routing.md +67 -67
  136. package/dist/extensions/builtin/loop/README.md +92 -92
  137. package/dist/extensions/builtin/mcp/figma-design.md +68 -68
  138. package/dist/extensions/builtin/mcp/mcp-management.md +85 -85
  139. package/dist/extensions/builtin/recap/AGENT.md +15 -15
  140. package/dist/extensions/builtin/sal/README.md +72 -72
  141. package/dist/extensions/builtin/security-audit/README.md +289 -289
  142. package/dist/extensions/builtin/team/AGENT.md +112 -112
  143. package/dist/extensions/builtin/team/TESTING.md +299 -299
  144. package/dist/extensions/builtin/token-save/README.md +56 -56
  145. package/dist/extensions/optional/AGENT.md +10 -10
  146. package/dist/modes/interactive/controllers/input-submit-controller.js +2 -2
  147. package/dist/modes/interactive/controllers/stream-render-controller.js +2 -2
  148. package/dist/modes/interactive/interactive-mode.js +19 -19
  149. package/dist/modes/interactive/theme/dark.json +85 -85
  150. package/dist/modes/interactive/theme/light.json +84 -84
  151. package/dist/modes/interactive/theme/theme-schema.json +335 -335
  152. package/dist/modes/interactive/theme/warm.json +81 -81
  153. package/dist/node_modules/@pencil-agent/ai/dist/cli.js +0 -0
  154. package/dist/node_modules/@pencil-agent/ai/dist/models.generated.js +1 -1
  155. package/docs/ACP/345/215/217/350/256/256/351/233/206/346/210/220/345/274/200/345/217/221/346/226/207/346/241/243.md +851 -0
  156. package/docs/SDK-TESTING.md +364 -0
  157. package/docs/codex-goal-command-impl.md +1055 -1055
  158. package/docs/codex-goal-vs-grub.md +500 -500
  159. package/docs/custom-provider.md +27 -27
  160. package/docs/extensions.md +27 -27
  161. package/docs/keybindings.md +27 -27
  162. package/docs/loop /351/207/215/346/236/204/345/256/214/346/210/220/346/200/273/347/273/223.md" +250 -250
  163. package/docs/loop /351/207/215/346/236/204/345/256/214/346/210/220/346/212/245/345/221/212.md" +122 -122
  164. package/docs/loop /351/207/215/346/236/204/346/226/271/346/241/210.md" +1222 -1222
  165. package/docs/loop /351/207/215/346/236/204/346/226/271/346/241/210/345/256/236/347/216/260/346/212/245/345/221/212.md" +158 -158
  166. package/docs/loop /351/207/215/346/236/204/346/226/271/346/241/210/345/257/271/346/257/224/345/210/206/346/236/220.md" +128 -128
  167. package/docs/loop /351/207/215/346/236/204/350/256/241/345/210/222.md" +320 -320
  168. package/docs/loop-usage-examples.md +214 -214
  169. package/docs/mem-core/346/212/200/346/234/257/346/226/207/346/241/243.md +593 -0
  170. package/docs/models.md +27 -27
  171. package/docs/packages.md +27 -27
  172. package/docs/pi-design-philosophy.md +457 -457
  173. package/docs/planmode.md +1987 -1987
  174. package/docs/prompt-templates.md +27 -27
  175. package/docs/providers.md +27 -27
  176. package/docs/sdk.md +27 -27
  177. package/docs/skills.md +27 -27
  178. package/docs/startup-performance-optimization.md +301 -0
  179. package/docs/themes.md +27 -27
  180. package/docs/tui.md +27 -27
  181. package/docs//350/256/244/347/237/245/345/234/260/345/233/276.md +47 -0
  182. package/package.json +190 -190
  183. package/docs/cc-agent-design.md +0 -1297
  184. package/docs/cc-tui-design.md +0 -1333
  185. package/docs/nanoPencil-/345/255/246/344/271/240/350/256/241/345/210/222.md +0 -170
  186. package/docs/scan-report.md +0 -3820
  187. package/docs//345/257/271/346/240/207Claude-Code.md +0 -1775
  188. package/docs//351/230/277/351/207/214/345/267/264/345/267/264/350/264/242/346/212/245/345/210/206/346/236/220/344/271/246.md +0 -261
@@ -1,444 +1,444 @@
1
- # Walmart — Product Search & Data Extraction
2
-
3
- Field-tested against walmart.com on 2026-04-18 using `http_get` (no browser required).
4
- All code blocks were run and outputs verified against live responses.
5
-
6
- ---
7
-
8
- ## Fastest Approach: `http_get` with `__NEXT_DATA__`
9
-
10
- Walmart's Next.js SSR embeds the full search or product payload as JSON in a
11
- `<script id="__NEXT_DATA__">` tag. **No browser needed for search or product detail pages.**
12
- ~2–3 s per page fetch; no CAPTCHA or session cookies required.
13
-
14
- ### Critical UA rule
15
-
16
- | User-Agent | Result |
17
- |---|---|
18
- | `Mozilla/5.0` (bare) | Full HTML + `__NEXT_DATA__` — **use this** |
19
- | `Mozilla/5.0 ... Chrome/120 ...` (full) | PerimeterX "Robot or human?" challenge (200, 15 KB) |
20
- | `Safari/17` full UA | Works (full HTML, ~1.15 MB) |
21
- | `curl/7.x` | PerimeterX challenge |
22
- | `python-requests/2.31` | PerimeterX challenge |
23
-
24
- The bare `Mozilla/5.0` string bypasses PerimeterX. Any UA that looks like a headless
25
- client or includes a recognizable browser fingerprint triggers the JS challenge page.
26
-
27
- ### Base fetch helper
28
-
29
- ```python
30
- import json, re, gzip, urllib.request
31
-
32
- def fetch_walmart(url):
33
- """
34
- Fetch any walmart.com page.
35
- Returns decoded HTML string.
36
- Raises RuntimeError if PerimeterX bot challenge is returned.
37
- """
38
- req = urllib.request.Request(
39
- url,
40
- headers={"User-Agent": "Mozilla/5.0", "Accept-Encoding": "gzip"},
41
- )
42
- with urllib.request.urlopen(req, timeout=20) as r:
43
- data = r.read()
44
- if r.headers.get("Content-Encoding") == "gzip":
45
- data = gzip.decompress(data)
46
- html = data.decode()
47
- if "Robot or human" in html:
48
- raise RuntimeError(f"PerimeterX challenge triggered: {url}")
49
- return html
50
-
51
- def parse_next_data(html):
52
- m = re.search(r'id="__NEXT_DATA__"[^>]*>(.*?)</script>', html, re.DOTALL)
53
- if not m:
54
- raise ValueError("__NEXT_DATA__ not found — page structure may have changed")
55
- return json.loads(m.group(1))
56
- ```
57
-
58
- ---
59
-
60
- ## Search Results
61
-
62
- ### URL patterns
63
-
64
- ```python
65
- # Keyword search
66
- "https://www.walmart.com/search?q=laptop"
67
-
68
- # Pagination — append &page=N
69
- "https://www.walmart.com/search?q=laptop&page=2"
70
-
71
- # Sort options (confirmed working)
72
- "https://www.walmart.com/search?q=laptop&sort=best_match" # default
73
- "https://www.walmart.com/search?q=laptop&sort=best_seller"
74
- "https://www.walmart.com/search?q=laptop&sort=price_low"
75
- "https://www.walmart.com/search?q=laptop&sort=customer_rating"
76
-
77
- # Price filter
78
- "https://www.walmart.com/search?q=laptop&min_price=200&max_price=500"
79
-
80
- # Browse by category (department ID path)
81
- "https://www.walmart.com/browse/electronics/laptops/3944_1089430_3951"
82
- ```
83
-
84
- ### `__NEXT_DATA__` path to items
85
-
86
- ```
87
- data
88
- .props.pageProps.initialData.searchResult
89
- .aggregatedCount — int: total matching products (e.g. 18818)
90
- .paginationV2.maxPage — int: last page number
91
- .itemStacks[] — array of stacks (usually 2: sponsored + organic)
92
- .items[] — array of product objects
93
- ```
94
-
95
- ### Full extractor (field-tested)
96
-
97
- ```python
98
- def extract_search_results(html):
99
- """
100
- Returns (items, total_count, max_page).
101
- items is a list of dicts with confirmed fields.
102
- """
103
- data = parse_next_data(html)
104
- sr = data["props"]["pageProps"]["initialData"]["searchResult"]
105
-
106
- items = []
107
- for stack in sr.get("itemStacks", []):
108
- for item in stack.get("items", []):
109
- pi = item.get("priceInfo") or {}
110
- img = item.get("imageInfo") or {}
111
- rating = item.get("rating") or {}
112
- avail = item.get("availabilityStatusV2") or {}
113
- items.append({
114
- "usItemId": item.get("usItemId"), # str, Walmart item ID
115
- "name": item.get("name"), # str
116
- "brand": item.get("brand"), # str or None
117
- "price": item.get("price"), # int, current price in USD
118
- "linePrice": pi.get("linePrice"), # str "$429.00"
119
- "wasPrice": pi.get("wasPrice") or None, # str "$699.00" or None
120
- "savings": pi.get("savings") or None, # str "SAVE $270.00" or None
121
- "averageRating": rating.get("averageRating"), # float e.g. 4.3
122
- "numberOfReviews": rating.get("numberOfReviews"), # int
123
- "availability": avail.get("value"), # "IN_STOCK" / "OUT_OF_STOCK"
124
- "isSponsored": bool(item.get("isSponsoredFlag")),
125
- "url": "https://www.walmart.com" + (item.get("canonicalUrl") or "").split("?")[0],
126
- "thumbnailUrl": img.get("thumbnailUrl"),
127
- })
128
-
129
- total = sr.get("aggregatedCount")
130
- max_page = (sr.get("paginationV2") or {}).get("maxPage")
131
- return items, total, max_page
132
-
133
-
134
- # Usage
135
- html = fetch_walmart("https://www.walmart.com/search?q=laptop")
136
- items, total, max_page = extract_search_results(html)
137
- # items: 66 items on page 1, total=18818, max_page=11
138
-
139
- # Filter out sponsored
140
- organic = [i for i in items if not i["isSponsored"]]
141
- ```
142
-
143
- ### Field notes (confirmed)
144
-
145
- - **`usItemId`**: string, matches the numeric ID at the end of `/ip/.../ITEMID` URLs.
146
- Some non-product rows (ad widgets) have `usItemId=None` — filter with `if item.get("usItemId")`.
147
- - **`price`**: integer cents-less price (e.g. `429` for "$429.00"). Use `priceInfo.linePrice` for
148
- the formatted string including the dollar sign.
149
- - **`wasPrice` / `savings`**: only present when item is on sale. Always `None` for full-price items.
150
- - **`isSponsoredFlag`**: the first batch of results across both itemStacks are frequently sponsored.
151
- On a laptop search, ~56 of 66 SSR items carry `isSponsoredFlag: true`.
152
- - **`rating`**: present on ~91% of items (60/66 in test). `averageRating` is a float; `numberOfReviews` is int.
153
- - **`canonicalUrl`**: always includes `?classType=...&athbdg=...` query params — strip with `.split("?")[0]`
154
- to get a clean URL.
155
- - **Two itemStacks**: Walmart returns two stacks (`itemStacks[0]` and `itemStacks[1]`). Merge them.
156
- `itemStacks[0]` is the primary grid; `itemStacks[1]` is a secondary sponsored/related block.
157
-
158
- ### Pagination
159
-
160
- ```python
161
- for page in range(1, max_page + 1):
162
- html = fetch_walmart(f"https://www.walmart.com/search?q=laptop&page={page}")
163
- items, _, _ = extract_search_results(html)
164
- # process items...
165
- ```
166
-
167
- Page responses average ~2.5 s each. No rate-limiting was observed across 3 sequential requests.
168
- For bulk scraping, add a 1–2 s delay between requests to be safe.
169
-
170
- ---
171
-
172
- ## Product Detail Page
173
-
174
- ### URL pattern
175
-
176
- ```
177
- https://www.walmart.com/ip/{slug}/{usItemId}
178
- ```
179
-
180
- The slug is ignored in routing — only the numeric `usItemId` matters.
181
- These work identically:
182
- ```
183
- https://www.walmart.com/ip/anything/19717318352
184
- https://www.walmart.com/ip/Apple-MacBook-Neo/19717318352
185
- ```
186
-
187
- ### `__NEXT_DATA__` path on a product page
188
-
189
- ```
190
- data.props.pageProps.initialData.data
191
- .product — core product object
192
- .idml — long description, specs, highlights, warranty
193
- .reviews — rating breakdown + first 10 customer reviews (SSR)
194
- ```
195
-
196
- ### Full extractor (field-tested)
197
-
198
- ```python
199
- def extract_product_detail(html):
200
- """
201
- Returns a dict with all confirmed product fields.
202
- idml.specifications returns all spec rows as a flat dict.
203
- reviews returns the SSR-rendered first 10 customer reviews.
204
- """
205
- data = parse_next_data(html)
206
- d = data["props"]["pageProps"]["initialData"]["data"]
207
- product = d["product"]
208
- idml = d.get("idml") or {}
209
- reviews = d.get("reviews") or {}
210
-
211
- pi = product.get("priceInfo") or {}
212
- cp = pi.get("currentPrice") or {}
213
- img = product.get("imageInfo") or {}
214
- avail = product.get("availabilityStatusV2") or {}
215
-
216
- specs = {
217
- spec.get("name"): spec.get("value")
218
- for spec in (idml.get("specifications") or [])
219
- }
220
-
221
- all_images = [
222
- img_item.get("url")
223
- for img_item in (img.get("allImages") or [])
224
- if img_item.get("url")
225
- ]
226
-
227
- customer_reviews = [
228
- {
229
- "title": r.get("reviewTitle"),
230
- "rating": r.get("rating"), # int 1-5 (field is "rating", NOT "overallRating")
231
- "text": r.get("reviewText"),
232
- "author": r.get("userNickname"),
233
- "date": r.get("reviewSubmissionTime"),
234
- }
235
- for r in (reviews.get("customerReviews") or [])
236
- ]
237
-
238
- return {
239
- # identity
240
- "usItemId": product.get("usItemId"),
241
- "name": product.get("name"),
242
- "brand": product.get("brand"),
243
- "model": product.get("model"),
244
- "upc": product.get("upc"),
245
- # price
246
- "price": cp.get("price"), # float, e.g. 599
247
- "priceString": cp.get("priceString"), # "$599.00"
248
- "wasPrice": (pi.get("wasPrice") or {}).get("priceString"),
249
- "savings": (pi.get("savings") or {}).get("savingsString"),
250
- # availability
251
- "availability": avail.get("value"), # "IN_STOCK" / "OUT_OF_STOCK"
252
- "availabilityDisplay": avail.get("display"), # "In stock"
253
- # ratings
254
- "averageRating": product.get("averageRating"),
255
- "numberOfReviews": product.get("numberOfReviews"),
256
- # text
257
- "shortDescription": product.get("shortDescription"),
258
- "longDescription": idml.get("longDescription"), # HTML string
259
- # media
260
- "thumbnailUrl": img.get("thumbnailUrl"),
261
- "allImages": all_images, # up to 10 image URLs
262
- # specs
263
- "specifications": specs, # {"Brand": "Apple", "Processor": "A18 Pro", ...}
264
- "highlights": [ # top highlighted specs with icons
265
- {"name": h.get("name"), "value": h.get("value")}
266
- for h in (idml.get("productHighlights") or [])
267
- ],
268
- # URL
269
- "canonicalUrl": "https://www.walmart.com" + (product.get("canonicalUrl") or ""),
270
- # fulfillment
271
- "fulfillmentOptions": product.get("fulfillmentOptions") or [],
272
- # reviews (SSR-rendered, first 10)
273
- "reviewSummary": {
274
- "averageOverallRating": reviews.get("averageOverallRating"),
275
- "totalReviewCount": reviews.get("totalReviewCount"),
276
- "reviewsWithTextCount": reviews.get("reviewsWithTextCount"),
277
- "recommendedPercentage": reviews.get("recommendedPercentage"),
278
- },
279
- "customerReviews": customer_reviews,
280
- }
281
-
282
-
283
- # Usage
284
- url = "https://www.walmart.com/ip/Apple-MacBook-Neo/19717318352"
285
- html = fetch_walmart(url)
286
- product = extract_product_detail(html)
287
-
288
- # Example output (confirmed live):
289
- # product["name"] → "Apple MacBook Neo 13-inch Apple A18 Pro chip..."
290
- # product["price"] → 599
291
- # product["priceString"] → "$599.00"
292
- # product["availability"] → "IN_STOCK"
293
- # product["model"] → "MHFD4LL/A"
294
- # product["upc"] → "195950852745"
295
- # len(product["specifications"]) → 29 spec rows
296
- # len(product["allImages"]) → 10
297
- # product["specifications"]["Processor"] → "A18 Pro"
298
- ```
299
-
300
- ### Field notes (confirmed)
301
-
302
- - **`averageRating` / `numberOfReviews`** on the product node: present for items with reviews.
303
- New/few-review items may return `None` for both.
304
- - **`reviewSummary.averageOverallRating`** in the reviews node often differs slightly from
305
- `product.averageRating` — the reviews node is more precise (e.g. `4.75` vs `4.8`).
306
- - **`customerReviews`** (SSR): always the first 10 reviews. The per-review rating field is `"rating"`
307
- (int 1–5), **not** `"overallRating"` (which is always `None`).
308
- - **`longDescription`**: raw HTML string including `<ul>/<li>` tags. Strip tags before display.
309
- - **`specifications`**: flat dict — confirmed 29–31 rows for electronics. Key names use display labels
310
- (e.g. `"RAM memory"`, `"Screen size"`, `"HD capacity"`).
311
- - **`wasPrice` / `savings`** on detail page: same as search — `None` when item is not discounted.
312
- - **No JSON-LD**: Walmart product pages do **not** include `<script type="application/ld+json">`.
313
- All structured data lives in `__NEXT_DATA__`.
314
-
315
- ---
316
-
317
- ## Anti-Bot: PerimeterX
318
-
319
- Walmart uses **PerimeterX** (app ID `PXu6b0qd2S`, confirmed in `runtimeConfig.perimeterX`).
320
-
321
- | Signal | Detail |
322
- |---|---|
323
- | Bot detector | PerimeterX |
324
- | Challenge page | "Robot or human?" — 200 OK, 15 KB HTML |
325
- | Triggered by | Full browser UA strings (Chrome, curl, python-requests) |
326
- | Bypassed by | `User-Agent: Mozilla/5.0` (bare prefix only) |
327
- | No JS execution | SSR response is complete — no JS challenge to solve |
328
-
329
- Detection in code:
330
- ```python
331
- if "Robot or human" in html:
332
- raise RuntimeError("PerimeterX challenge — switch to browser harness")
333
- ```
334
-
335
- If `http_get` starts returning the challenge after a run of successful fetches, switch to the
336
- browser harness (see below).
337
-
338
- ---
339
-
340
- ## Browser Harness Fallback
341
-
342
- Use the browser harness when:
343
- - PerimeterX starts blocking `http_get` on your IP
344
- - You need to interact with the page (add to cart, filter UI, infinite scroll)
345
- - You need variant switching (color/size selectors)
346
-
347
- ```python
348
- # Browser-based search extraction
349
- new_tab("https://www.walmart.com/search?q=laptop")
350
- wait_for_load()
351
- wait(2) # JS renders product cards after readyState=complete
352
-
353
- # Extract via __NEXT_DATA__ in-browser (identical structure to http_get)
354
- import json
355
- nd = js("document.getElementById('__NEXT_DATA__')?.textContent")
356
- data = json.loads(nd)
357
- sr = data["props"]["pageProps"]["initialData"]["searchResult"]
358
- items = []
359
- for stack in sr.get("itemStacks", []):
360
- items.extend(stack.get("items", []))
361
- ```
362
-
363
- ### Browser selectors (confirmed working for DOM-based extraction)
364
-
365
- ```python
366
- # Product cards on search results page
367
- results = js("""
368
- Array.from(document.querySelectorAll('[data-item-id]')).map(el => ({
369
- itemId: el.getAttribute('data-item-id'),
370
- name: el.querySelector('[itemprop="name"]')?.innerText?.trim(),
371
- price: el.querySelector('[itemprop="price"]')?.getAttribute('content'),
372
- url: el.querySelector('a[link-identifier]')?.href,
373
- })).filter(r => r.itemId)
374
- """)
375
-
376
- # If [data-item-id] misses items, use the Next.js data attribute alternative:
377
- results_alt = js("""
378
- Array.from(document.querySelectorAll('[data-testid="list-view"]'))
379
- .map(el => el.innerText.trim())
380
- """)
381
- ```
382
-
383
- > **Prefer `__NEXT_DATA__` over DOM selectors** even in-browser — the JSON is complete and
384
- > stable. DOM class names at Walmart are obfuscated and change between deployments.
385
-
386
- ### Session gotcha
387
-
388
- Always open Walmart with `new_tab()` on first visit:
389
- ```python
390
- new_tab("https://www.walmart.com/search?q=laptop")
391
- wait_for_load()
392
- wait(2)
393
- ```
394
- After that, `goto_url()` works normally within the same session.
395
-
396
- ---
397
-
398
- ## Public API
399
-
400
- Walmart's affiliate/partner API (`developer.api.walmart.com`) requires a registered API key
401
- and returns HTTP 403 without one. No unauthenticated public product API is available.
402
- The `__NEXT_DATA__` SSR approach replaces any need for the official API for read-only data.
403
-
404
- ---
405
-
406
- ## Gotchas
407
-
408
- - **UA must be `Mozilla/5.0` bare**: Any fuller string (Chrome, Safari, curl, requests) hits
409
- PerimeterX. This is counterintuitive — the *shorter*, less realistic UA is the one that works.
410
-
411
- - **Regex must use `id=` attribute match**: The regex
412
- `r'<script id="__NEXT_DATA__" type="application/json">...'` fails because the actual tag is
413
- `<script id="__NEXT_DATA__">` without `type`. Use:
414
- ```python
415
- re.search(r'id="__NEXT_DATA__"[^>]*>(.*?)</script>', html, re.DOTALL)
416
- ```
417
-
418
- - **`usItemId` can be `None`**: ~5/66 items on a page are non-product ad widgets with no `usItemId`.
419
- Always filter: `[i for i in items if i.get("usItemId")]`.
420
-
421
- - **Two `itemStacks`**: Walmart returns two stacks. Iterate over all stacks or you'll miss
422
- ~10 items from the second stack.
423
-
424
- - **`canonicalUrl` includes tracking params**: Always strip with `.split("?")[0]`.
425
-
426
- - **Review field is `"rating"` not `"overallRating"`**: Each `customerReviews` entry has a `"rating"`
427
- int field (1–5). The `"overallRating"` field is always `None`. Don't confuse with
428
- `product.averageRating` (the aggregate float).
429
-
430
- - **No JSON-LD on product pages**: Zero `<script type="application/ld+json">` tags were found.
431
- All structured data is in `__NEXT_DATA__`.
432
-
433
- - **`longDescription` is HTML**: Strip tags before text use. May contain promotional/financing copy
434
- mixed with real product description.
435
-
436
- - **Page sizes vary**: Page 1 returned 66 items across 2 stacks; page 2 returned 55.
437
- Do not assume a fixed items-per-page count.
438
-
439
- - **`http_get` default already sends `Mozilla/5.0`**: `helpers.http_get()` uses
440
- `"User-Agent": "Mozilla/5.0"` by default — no override needed when calling it directly.
441
- Only pass a custom `headers=` if you need to change something else.
442
-
443
- - **`developer.api.walmart.com`** returns HTTP 403 without an API key. Not usable for
444
- unauthenticated scraping.
1
+ # Walmart — Product Search & Data Extraction
2
+
3
+ Field-tested against walmart.com on 2026-04-18 using `http_get` (no browser required).
4
+ All code blocks were run and outputs verified against live responses.
5
+
6
+ ---
7
+
8
+ ## Fastest Approach: `http_get` with `__NEXT_DATA__`
9
+
10
+ Walmart's Next.js SSR embeds the full search or product payload as JSON in a
11
+ `<script id="__NEXT_DATA__">` tag. **No browser needed for search or product detail pages.**
12
+ ~2–3 s per page fetch; no CAPTCHA or session cookies required.
13
+
14
+ ### Critical UA rule
15
+
16
+ | User-Agent | Result |
17
+ |---|---|
18
+ | `Mozilla/5.0` (bare) | Full HTML + `__NEXT_DATA__` — **use this** |
19
+ | `Mozilla/5.0 ... Chrome/120 ...` (full) | PerimeterX "Robot or human?" challenge (200, 15 KB) |
20
+ | `Safari/17` full UA | Works (full HTML, ~1.15 MB) |
21
+ | `curl/7.x` | PerimeterX challenge |
22
+ | `python-requests/2.31` | PerimeterX challenge |
23
+
24
+ The bare `Mozilla/5.0` string bypasses PerimeterX. Any UA that looks like a headless
25
+ client or includes a recognizable browser fingerprint triggers the JS challenge page.
26
+
27
+ ### Base fetch helper
28
+
29
+ ```python
30
+ import json, re, gzip, urllib.request
31
+
32
+ def fetch_walmart(url):
33
+ """
34
+ Fetch any walmart.com page.
35
+ Returns decoded HTML string.
36
+ Raises RuntimeError if PerimeterX bot challenge is returned.
37
+ """
38
+ req = urllib.request.Request(
39
+ url,
40
+ headers={"User-Agent": "Mozilla/5.0", "Accept-Encoding": "gzip"},
41
+ )
42
+ with urllib.request.urlopen(req, timeout=20) as r:
43
+ data = r.read()
44
+ if r.headers.get("Content-Encoding") == "gzip":
45
+ data = gzip.decompress(data)
46
+ html = data.decode()
47
+ if "Robot or human" in html:
48
+ raise RuntimeError(f"PerimeterX challenge triggered: {url}")
49
+ return html
50
+
51
+ def parse_next_data(html):
52
+ m = re.search(r'id="__NEXT_DATA__"[^>]*>(.*?)</script>', html, re.DOTALL)
53
+ if not m:
54
+ raise ValueError("__NEXT_DATA__ not found — page structure may have changed")
55
+ return json.loads(m.group(1))
56
+ ```
57
+
58
+ ---
59
+
60
+ ## Search Results
61
+
62
+ ### URL patterns
63
+
64
+ ```python
65
+ # Keyword search
66
+ "https://www.walmart.com/search?q=laptop"
67
+
68
+ # Pagination — append &page=N
69
+ "https://www.walmart.com/search?q=laptop&page=2"
70
+
71
+ # Sort options (confirmed working)
72
+ "https://www.walmart.com/search?q=laptop&sort=best_match" # default
73
+ "https://www.walmart.com/search?q=laptop&sort=best_seller"
74
+ "https://www.walmart.com/search?q=laptop&sort=price_low"
75
+ "https://www.walmart.com/search?q=laptop&sort=customer_rating"
76
+
77
+ # Price filter
78
+ "https://www.walmart.com/search?q=laptop&min_price=200&max_price=500"
79
+
80
+ # Browse by category (department ID path)
81
+ "https://www.walmart.com/browse/electronics/laptops/3944_1089430_3951"
82
+ ```
83
+
84
+ ### `__NEXT_DATA__` path to items
85
+
86
+ ```
87
+ data
88
+ .props.pageProps.initialData.searchResult
89
+ .aggregatedCount — int: total matching products (e.g. 18818)
90
+ .paginationV2.maxPage — int: last page number
91
+ .itemStacks[] — array of stacks (usually 2: sponsored + organic)
92
+ .items[] — array of product objects
93
+ ```
94
+
95
+ ### Full extractor (field-tested)
96
+
97
+ ```python
98
+ def extract_search_results(html):
99
+ """
100
+ Returns (items, total_count, max_page).
101
+ items is a list of dicts with confirmed fields.
102
+ """
103
+ data = parse_next_data(html)
104
+ sr = data["props"]["pageProps"]["initialData"]["searchResult"]
105
+
106
+ items = []
107
+ for stack in sr.get("itemStacks", []):
108
+ for item in stack.get("items", []):
109
+ pi = item.get("priceInfo") or {}
110
+ img = item.get("imageInfo") or {}
111
+ rating = item.get("rating") or {}
112
+ avail = item.get("availabilityStatusV2") or {}
113
+ items.append({
114
+ "usItemId": item.get("usItemId"), # str, Walmart item ID
115
+ "name": item.get("name"), # str
116
+ "brand": item.get("brand"), # str or None
117
+ "price": item.get("price"), # int, current price in USD
118
+ "linePrice": pi.get("linePrice"), # str "$429.00"
119
+ "wasPrice": pi.get("wasPrice") or None, # str "$699.00" or None
120
+ "savings": pi.get("savings") or None, # str "SAVE $270.00" or None
121
+ "averageRating": rating.get("averageRating"), # float e.g. 4.3
122
+ "numberOfReviews": rating.get("numberOfReviews"), # int
123
+ "availability": avail.get("value"), # "IN_STOCK" / "OUT_OF_STOCK"
124
+ "isSponsored": bool(item.get("isSponsoredFlag")),
125
+ "url": "https://www.walmart.com" + (item.get("canonicalUrl") or "").split("?")[0],
126
+ "thumbnailUrl": img.get("thumbnailUrl"),
127
+ })
128
+
129
+ total = sr.get("aggregatedCount")
130
+ max_page = (sr.get("paginationV2") or {}).get("maxPage")
131
+ return items, total, max_page
132
+
133
+
134
+ # Usage
135
+ html = fetch_walmart("https://www.walmart.com/search?q=laptop")
136
+ items, total, max_page = extract_search_results(html)
137
+ # items: 66 items on page 1, total=18818, max_page=11
138
+
139
+ # Filter out sponsored
140
+ organic = [i for i in items if not i["isSponsored"]]
141
+ ```
142
+
143
+ ### Field notes (confirmed)
144
+
145
+ - **`usItemId`**: string, matches the numeric ID at the end of `/ip/.../ITEMID` URLs.
146
+ Some non-product rows (ad widgets) have `usItemId=None` — filter with `if item.get("usItemId")`.
147
+ - **`price`**: integer cents-less price (e.g. `429` for "$429.00"). Use `priceInfo.linePrice` for
148
+ the formatted string including the dollar sign.
149
+ - **`wasPrice` / `savings`**: only present when item is on sale. Always `None` for full-price items.
150
+ - **`isSponsoredFlag`**: the first batch of results across both itemStacks are frequently sponsored.
151
+ On a laptop search, ~56 of 66 SSR items carry `isSponsoredFlag: true`.
152
+ - **`rating`**: present on ~91% of items (60/66 in test). `averageRating` is a float; `numberOfReviews` is int.
153
+ - **`canonicalUrl`**: always includes `?classType=...&athbdg=...` query params — strip with `.split("?")[0]`
154
+ to get a clean URL.
155
+ - **Two itemStacks**: Walmart returns two stacks (`itemStacks[0]` and `itemStacks[1]`). Merge them.
156
+ `itemStacks[0]` is the primary grid; `itemStacks[1]` is a secondary sponsored/related block.
157
+
158
+ ### Pagination
159
+
160
+ ```python
161
+ for page in range(1, max_page + 1):
162
+ html = fetch_walmart(f"https://www.walmart.com/search?q=laptop&page={page}")
163
+ items, _, _ = extract_search_results(html)
164
+ # process items...
165
+ ```
166
+
167
+ Page responses average ~2.5 s each. No rate-limiting was observed across 3 sequential requests.
168
+ For bulk scraping, add a 1–2 s delay between requests to be safe.
169
+
170
+ ---
171
+
172
+ ## Product Detail Page
173
+
174
+ ### URL pattern
175
+
176
+ ```
177
+ https://www.walmart.com/ip/{slug}/{usItemId}
178
+ ```
179
+
180
+ The slug is ignored in routing — only the numeric `usItemId` matters.
181
+ These work identically:
182
+ ```
183
+ https://www.walmart.com/ip/anything/19717318352
184
+ https://www.walmart.com/ip/Apple-MacBook-Neo/19717318352
185
+ ```
186
+
187
+ ### `__NEXT_DATA__` path on a product page
188
+
189
+ ```
190
+ data.props.pageProps.initialData.data
191
+ .product — core product object
192
+ .idml — long description, specs, highlights, warranty
193
+ .reviews — rating breakdown + first 10 customer reviews (SSR)
194
+ ```
195
+
196
+ ### Full extractor (field-tested)
197
+
198
+ ```python
199
+ def extract_product_detail(html):
200
+ """
201
+ Returns a dict with all confirmed product fields.
202
+ idml.specifications returns all spec rows as a flat dict.
203
+ reviews returns the SSR-rendered first 10 customer reviews.
204
+ """
205
+ data = parse_next_data(html)
206
+ d = data["props"]["pageProps"]["initialData"]["data"]
207
+ product = d["product"]
208
+ idml = d.get("idml") or {}
209
+ reviews = d.get("reviews") or {}
210
+
211
+ pi = product.get("priceInfo") or {}
212
+ cp = pi.get("currentPrice") or {}
213
+ img = product.get("imageInfo") or {}
214
+ avail = product.get("availabilityStatusV2") or {}
215
+
216
+ specs = {
217
+ spec.get("name"): spec.get("value")
218
+ for spec in (idml.get("specifications") or [])
219
+ }
220
+
221
+ all_images = [
222
+ img_item.get("url")
223
+ for img_item in (img.get("allImages") or [])
224
+ if img_item.get("url")
225
+ ]
226
+
227
+ customer_reviews = [
228
+ {
229
+ "title": r.get("reviewTitle"),
230
+ "rating": r.get("rating"), # int 1-5 (field is "rating", NOT "overallRating")
231
+ "text": r.get("reviewText"),
232
+ "author": r.get("userNickname"),
233
+ "date": r.get("reviewSubmissionTime"),
234
+ }
235
+ for r in (reviews.get("customerReviews") or [])
236
+ ]
237
+
238
+ return {
239
+ # identity
240
+ "usItemId": product.get("usItemId"),
241
+ "name": product.get("name"),
242
+ "brand": product.get("brand"),
243
+ "model": product.get("model"),
244
+ "upc": product.get("upc"),
245
+ # price
246
+ "price": cp.get("price"), # float, e.g. 599
247
+ "priceString": cp.get("priceString"), # "$599.00"
248
+ "wasPrice": (pi.get("wasPrice") or {}).get("priceString"),
249
+ "savings": (pi.get("savings") or {}).get("savingsString"),
250
+ # availability
251
+ "availability": avail.get("value"), # "IN_STOCK" / "OUT_OF_STOCK"
252
+ "availabilityDisplay": avail.get("display"), # "In stock"
253
+ # ratings
254
+ "averageRating": product.get("averageRating"),
255
+ "numberOfReviews": product.get("numberOfReviews"),
256
+ # text
257
+ "shortDescription": product.get("shortDescription"),
258
+ "longDescription": idml.get("longDescription"), # HTML string
259
+ # media
260
+ "thumbnailUrl": img.get("thumbnailUrl"),
261
+ "allImages": all_images, # up to 10 image URLs
262
+ # specs
263
+ "specifications": specs, # {"Brand": "Apple", "Processor": "A18 Pro", ...}
264
+ "highlights": [ # top highlighted specs with icons
265
+ {"name": h.get("name"), "value": h.get("value")}
266
+ for h in (idml.get("productHighlights") or [])
267
+ ],
268
+ # URL
269
+ "canonicalUrl": "https://www.walmart.com" + (product.get("canonicalUrl") or ""),
270
+ # fulfillment
271
+ "fulfillmentOptions": product.get("fulfillmentOptions") or [],
272
+ # reviews (SSR-rendered, first 10)
273
+ "reviewSummary": {
274
+ "averageOverallRating": reviews.get("averageOverallRating"),
275
+ "totalReviewCount": reviews.get("totalReviewCount"),
276
+ "reviewsWithTextCount": reviews.get("reviewsWithTextCount"),
277
+ "recommendedPercentage": reviews.get("recommendedPercentage"),
278
+ },
279
+ "customerReviews": customer_reviews,
280
+ }
281
+
282
+
283
+ # Usage
284
+ url = "https://www.walmart.com/ip/Apple-MacBook-Neo/19717318352"
285
+ html = fetch_walmart(url)
286
+ product = extract_product_detail(html)
287
+
288
+ # Example output (confirmed live):
289
+ # product["name"] → "Apple MacBook Neo 13-inch Apple A18 Pro chip..."
290
+ # product["price"] → 599
291
+ # product["priceString"] → "$599.00"
292
+ # product["availability"] → "IN_STOCK"
293
+ # product["model"] → "MHFD4LL/A"
294
+ # product["upc"] → "195950852745"
295
+ # len(product["specifications"]) → 29 spec rows
296
+ # len(product["allImages"]) → 10
297
+ # product["specifications"]["Processor"] → "A18 Pro"
298
+ ```
299
+
300
+ ### Field notes (confirmed)
301
+
302
+ - **`averageRating` / `numberOfReviews`** on the product node: present for items with reviews.
303
+ New/few-review items may return `None` for both.
304
+ - **`reviewSummary.averageOverallRating`** in the reviews node often differs slightly from
305
+ `product.averageRating` — the reviews node is more precise (e.g. `4.75` vs `4.8`).
306
+ - **`customerReviews`** (SSR): always the first 10 reviews. The per-review rating field is `"rating"`
307
+ (int 1–5), **not** `"overallRating"` (which is always `None`).
308
+ - **`longDescription`**: raw HTML string including `<ul>/<li>` tags. Strip tags before display.
309
+ - **`specifications`**: flat dict — confirmed 29–31 rows for electronics. Key names use display labels
310
+ (e.g. `"RAM memory"`, `"Screen size"`, `"HD capacity"`).
311
+ - **`wasPrice` / `savings`** on detail page: same as search — `None` when item is not discounted.
312
+ - **No JSON-LD**: Walmart product pages do **not** include `<script type="application/ld+json">`.
313
+ All structured data lives in `__NEXT_DATA__`.
314
+
315
+ ---
316
+
317
+ ## Anti-Bot: PerimeterX
318
+
319
+ Walmart uses **PerimeterX** (app ID `PXu6b0qd2S`, confirmed in `runtimeConfig.perimeterX`).
320
+
321
+ | Signal | Detail |
322
+ |---|---|
323
+ | Bot detector | PerimeterX |
324
+ | Challenge page | "Robot or human?" — 200 OK, 15 KB HTML |
325
+ | Triggered by | Full browser UA strings (Chrome, curl, python-requests) |
326
+ | Bypassed by | `User-Agent: Mozilla/5.0` (bare prefix only) |
327
+ | No JS execution | SSR response is complete — no JS challenge to solve |
328
+
329
+ Detection in code:
330
+ ```python
331
+ if "Robot or human" in html:
332
+ raise RuntimeError("PerimeterX challenge — switch to browser harness")
333
+ ```
334
+
335
+ If `http_get` starts returning the challenge after a run of successful fetches, switch to the
336
+ browser harness (see below).
337
+
338
+ ---
339
+
340
+ ## Browser Harness Fallback
341
+
342
+ Use the browser harness when:
343
+ - PerimeterX starts blocking `http_get` on your IP
344
+ - You need to interact with the page (add to cart, filter UI, infinite scroll)
345
+ - You need variant switching (color/size selectors)
346
+
347
+ ```python
348
+ # Browser-based search extraction
349
+ new_tab("https://www.walmart.com/search?q=laptop")
350
+ wait_for_load()
351
+ wait(2) # JS renders product cards after readyState=complete
352
+
353
+ # Extract via __NEXT_DATA__ in-browser (identical structure to http_get)
354
+ import json
355
+ nd = js("document.getElementById('__NEXT_DATA__')?.textContent")
356
+ data = json.loads(nd)
357
+ sr = data["props"]["pageProps"]["initialData"]["searchResult"]
358
+ items = []
359
+ for stack in sr.get("itemStacks", []):
360
+ items.extend(stack.get("items", []))
361
+ ```
362
+
363
+ ### Browser selectors (confirmed working for DOM-based extraction)
364
+
365
+ ```python
366
+ # Product cards on search results page
367
+ results = js("""
368
+ Array.from(document.querySelectorAll('[data-item-id]')).map(el => ({
369
+ itemId: el.getAttribute('data-item-id'),
370
+ name: el.querySelector('[itemprop="name"]')?.innerText?.trim(),
371
+ price: el.querySelector('[itemprop="price"]')?.getAttribute('content'),
372
+ url: el.querySelector('a[link-identifier]')?.href,
373
+ })).filter(r => r.itemId)
374
+ """)
375
+
376
+ # If [data-item-id] misses items, use the Next.js data attribute alternative:
377
+ results_alt = js("""
378
+ Array.from(document.querySelectorAll('[data-testid="list-view"]'))
379
+ .map(el => el.innerText.trim())
380
+ """)
381
+ ```
382
+
383
+ > **Prefer `__NEXT_DATA__` over DOM selectors** even in-browser — the JSON is complete and
384
+ > stable. DOM class names at Walmart are obfuscated and change between deployments.
385
+
386
+ ### Session gotcha
387
+
388
+ Always open Walmart with `new_tab()` on first visit:
389
+ ```python
390
+ new_tab("https://www.walmart.com/search?q=laptop")
391
+ wait_for_load()
392
+ wait(2)
393
+ ```
394
+ After that, `goto_url()` works normally within the same session.
395
+
396
+ ---
397
+
398
+ ## Public API
399
+
400
+ Walmart's affiliate/partner API (`developer.api.walmart.com`) requires a registered API key
401
+ and returns HTTP 403 without one. No unauthenticated public product API is available.
402
+ The `__NEXT_DATA__` SSR approach replaces any need for the official API for read-only data.
403
+
404
+ ---
405
+
406
+ ## Gotchas
407
+
408
+ - **UA must be `Mozilla/5.0` bare**: Any fuller string (Chrome, Safari, curl, requests) hits
409
+ PerimeterX. This is counterintuitive — the *shorter*, less realistic UA is the one that works.
410
+
411
+ - **Regex must use `id=` attribute match**: The regex
412
+ `r'<script id="__NEXT_DATA__" type="application/json">...'` fails because the actual tag is
413
+ `<script id="__NEXT_DATA__">` without `type`. Use:
414
+ ```python
415
+ re.search(r'id="__NEXT_DATA__"[^>]*>(.*?)</script>', html, re.DOTALL)
416
+ ```
417
+
418
+ - **`usItemId` can be `None`**: ~5/66 items on a page are non-product ad widgets with no `usItemId`.
419
+ Always filter: `[i for i in items if i.get("usItemId")]`.
420
+
421
+ - **Two `itemStacks`**: Walmart returns two stacks. Iterate over all stacks or you'll miss
422
+ ~10 items from the second stack.
423
+
424
+ - **`canonicalUrl` includes tracking params**: Always strip with `.split("?")[0]`.
425
+
426
+ - **Review field is `"rating"` not `"overallRating"`**: Each `customerReviews` entry has a `"rating"`
427
+ int field (1–5). The `"overallRating"` field is always `None`. Don't confuse with
428
+ `product.averageRating` (the aggregate float).
429
+
430
+ - **No JSON-LD on product pages**: Zero `<script type="application/ld+json">` tags were found.
431
+ All structured data is in `__NEXT_DATA__`.
432
+
433
+ - **`longDescription` is HTML**: Strip tags before text use. May contain promotional/financing copy
434
+ mixed with real product description.
435
+
436
+ - **Page sizes vary**: Page 1 returned 66 items across 2 stacks; page 2 returned 55.
437
+ Do not assume a fixed items-per-page count.
438
+
439
+ - **`http_get` default already sends `Mozilla/5.0`**: `helpers.http_get()` uses
440
+ `"User-Agent": "Mozilla/5.0"` by default — no override needed when calling it directly.
441
+ Only pass a custom `headers=` if you need to change something else.
442
+
443
+ - **`developer.api.walmart.com`** returns HTTP 403 without an API key. Not usable for
444
+ unauthenticated scraping.