@pencil-agent/nano-pencil 2.0.0 → 2.0.1

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
Files changed (195) hide show
  1. package/README.md +267 -267
  2. package/dist/build-meta.json +3 -3
  3. package/dist/core/export-html/AGENT.md +11 -11
  4. package/dist/core/export-html/template.css +971 -971
  5. package/dist/core/export-html/template.html +54 -54
  6. package/dist/core/mcp/mcp-client.d.ts +3 -1
  7. package/dist/core/mcp/mcp-client.js +6 -6
  8. package/dist/core/mcp/mcp-config.d.ts +3 -3
  9. package/dist/core/mcp/mcp-config.js +1 -1
  10. package/dist/core/mcp/mcp-manager.d.ts +5 -1
  11. package/dist/core/mcp/mcp-manager.js +1 -1
  12. package/dist/core/platform/config/resource-loader.d.ts +2 -0
  13. package/dist/core/platform/config/resource-loader.js +2 -2
  14. package/dist/core/runtime/agent-session.d.ts +12 -0
  15. package/dist/core/runtime/agent-session.js +8 -8
  16. package/dist/core/runtime/sdk.d.ts +8 -0
  17. package/dist/core/runtime/sdk.js +1 -1
  18. package/dist/extensions/builtin/AGENT.md +115 -115
  19. package/dist/extensions/builtin/browser/AGENT.md +17 -17
  20. package/dist/extensions/builtin/browser/agent-workspace/agent_helpers.py +12 -12
  21. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/amazon/product-search.md +198 -198
  22. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/archive-org/scraping.md +341 -341
  23. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/arxiv/scraping.md +311 -311
  24. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/arxiv-bulk/scraping.md +333 -333
  25. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/atlas/overview.md +70 -70
  26. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/booking-com/scraping.md +578 -578
  27. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/capterra/scraping.md +440 -440
  28. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/centilebrain/generate-estimates.md +110 -110
  29. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/coingecko/scraping.md +325 -325
  30. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/coinmarketcap/scraping.md +463 -463
  31. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/coursera/scraping.md +360 -360
  32. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/craigslist/scraping.md +390 -390
  33. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/crossref/scraping.md +568 -568
  34. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/dev-to/scraping.md +323 -323
  35. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/duckduckgo/scraping.md +349 -349
  36. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/ebay/scraping.md +435 -435
  37. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/etsy/scraping.md +506 -506
  38. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/eventbrite/scraping.md +363 -363
  39. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/expedia/automation.md +168 -168
  40. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/facebook/groups.md +236 -236
  41. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/facebook/pages.md +295 -295
  42. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/framer/editor.md +108 -108
  43. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/fred/scraping.md +493 -493
  44. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/g2/scraping.md +580 -580
  45. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/genius/scraping.md +511 -511
  46. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/github/repo-actions.md +65 -65
  47. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/github/scraping.md +184 -184
  48. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/glassdoor/scraping.md +543 -543
  49. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/gmail/compose.md +122 -122
  50. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/goodreads/scraping.md +461 -461
  51. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/gutenberg/scraping.md +383 -383
  52. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/hackernews/scraping.md +243 -243
  53. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/howlongtobeat/scraping.md +473 -473
  54. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/imdb/scraping.md +271 -271
  55. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/itch-io/scraping.md +436 -436
  56. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/job-boards/indeed-glassdoor.md +1021 -1021
  57. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/letterboxd/scraping.md +349 -349
  58. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/linkedin/invitation-manager.md +109 -109
  59. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/loom/folder-enumeration.md +170 -170
  60. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/macrotrends/scraping.md +537 -537
  61. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/medium/article-hydration.md +120 -120
  62. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/medium/scraping.md +414 -414
  63. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/metacritic/scraping.md +477 -477
  64. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/musicbrainz/scraping.md +478 -478
  65. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/nasa/scraping.md +339 -339
  66. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/news-aggregation/multi-source.md +205 -205
  67. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/open-library/scraping.md +472 -472
  68. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/openalex/scraping.md +470 -470
  69. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/openstreetmap/scraping.md +490 -490
  70. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/package-registries/npm-pypi.md +478 -478
  71. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/polymarket/scraping.md +234 -234
  72. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/producthunt/scraping.md +307 -307
  73. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/pubmed/scraping.md +421 -421
  74. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/quora/scraping.md +364 -364
  75. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/rawg/scraping.md +352 -352
  76. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/reddit/scraping.md +124 -124
  77. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/rest-countries/scraping.md +233 -233
  78. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/sec-edgar/scraping.md +361 -361
  79. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/shopify-admin/README.md +36 -36
  80. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/shopify-admin/embedded-apps.md +72 -72
  81. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/shopify-admin/knowledge-base.md +109 -109
  82. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/shopify-admin/polaris-inputs.md +137 -137
  83. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/soundcloud/scraping.md +362 -362
  84. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/spotify/scraping.md +339 -339
  85. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/stackoverflow/scraping.md +435 -435
  86. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/steam/scraping.md +575 -575
  87. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/substack/scraping.md +338 -338
  88. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/thetechgeeks/pricing.md +52 -52
  89. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/tiktok/upload.md +107 -107
  90. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/tradingview/scraping.md +309 -309
  91. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/trello/boards-and-lists.md +88 -88
  92. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/trustpilot/scraping.md +375 -375
  93. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/walmart/scraping.md +444 -444
  94. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/wayback-machine/scraping.md +306 -306
  95. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/weather/scraping.md +398 -398
  96. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/wellfound/scraping.md +596 -596
  97. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/world-bank/scraping.md +356 -356
  98. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/xiaohongshu/scraping.md +84 -84
  99. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/youtube/scraping.md +418 -418
  100. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/zillow/scraping.md +433 -433
  101. package/dist/extensions/builtin/browser/browser.md +73 -73
  102. package/dist/extensions/builtin/browser/install.md +142 -142
  103. package/dist/extensions/builtin/browser/interaction-skills/connection.md +48 -48
  104. package/dist/extensions/builtin/browser/interaction-skills/cookies.md +3 -3
  105. package/dist/extensions/builtin/browser/interaction-skills/cross-origin-iframes.md +3 -3
  106. package/dist/extensions/builtin/browser/interaction-skills/dialogs.md +64 -64
  107. package/dist/extensions/builtin/browser/interaction-skills/downloads.md +3 -3
  108. package/dist/extensions/builtin/browser/interaction-skills/drag-and-drop.md +3 -3
  109. package/dist/extensions/builtin/browser/interaction-skills/dropdowns.md +3 -3
  110. package/dist/extensions/builtin/browser/interaction-skills/iframes.md +3 -3
  111. package/dist/extensions/builtin/browser/interaction-skills/network-requests.md +3 -3
  112. package/dist/extensions/builtin/browser/interaction-skills/print-as-pdf.md +3 -3
  113. package/dist/extensions/builtin/browser/interaction-skills/profile-sync.md +90 -90
  114. package/dist/extensions/builtin/browser/interaction-skills/screenshots.md +17 -17
  115. package/dist/extensions/builtin/browser/interaction-skills/scrolling.md +3 -3
  116. package/dist/extensions/builtin/browser/interaction-skills/shadow-dom.md +3 -3
  117. package/dist/extensions/builtin/browser/interaction-skills/tabs.md +69 -69
  118. package/dist/extensions/builtin/browser/interaction-skills/uploads.md +1 -1
  119. package/dist/extensions/builtin/browser/interaction-skills/viewport.md +3 -3
  120. package/dist/extensions/builtin/browser/src/browser_harness/AGENT.md +15 -15
  121. package/dist/extensions/builtin/browser/src/browser_harness/__init__.py +8 -8
  122. package/dist/extensions/builtin/browser/src/browser_harness/_ipc.py +90 -90
  123. package/dist/extensions/builtin/browser/src/browser_harness/admin.py +722 -722
  124. package/dist/extensions/builtin/browser/src/browser_harness/daemon.py +328 -328
  125. package/dist/extensions/builtin/browser/src/browser_harness/helpers.py +396 -396
  126. package/dist/extensions/builtin/browser/src/browser_harness/run.py +103 -103
  127. package/dist/extensions/builtin/discipline/skills/brainstorming/SKILL.md +33 -33
  128. package/dist/extensions/builtin/discipline/skills/executing-plans/SKILL.md +25 -25
  129. package/dist/extensions/builtin/discipline/skills/finishing-development-branch/SKILL.md +25 -25
  130. package/dist/extensions/builtin/discipline/skills/receiving-code-review/SKILL.md +22 -22
  131. package/dist/extensions/builtin/discipline/skills/requesting-code-review/SKILL.md +31 -31
  132. package/dist/extensions/builtin/discipline/skills/systematic-debugging/SKILL.md +28 -28
  133. package/dist/extensions/builtin/discipline/skills/test-driven-development/SKILL.md +32 -32
  134. package/dist/extensions/builtin/discipline/skills/using-git-worktrees/SKILL.md +25 -25
  135. package/dist/extensions/builtin/discipline/skills/verification-before-completion/SKILL.md +27 -27
  136. package/dist/extensions/builtin/discipline/skills/writing-plans/SKILL.md +26 -26
  137. package/dist/extensions/builtin/goal/README.md +67 -67
  138. package/dist/extensions/builtin/grub/README.md +112 -112
  139. package/dist/extensions/builtin/link-world/agent-workspace/README.md +16 -16
  140. package/dist/extensions/builtin/link-world/internet-search/internet-search.md +65 -65
  141. package/dist/extensions/builtin/link-world/link-world-agent.md +82 -82
  142. package/dist/extensions/builtin/link-world/linkworld.md +313 -313
  143. package/dist/extensions/builtin/link-world/network-routing/network-routing.md +67 -67
  144. package/dist/extensions/builtin/loop/README.md +92 -92
  145. package/dist/extensions/builtin/mcp/figma-design.md +68 -68
  146. package/dist/extensions/builtin/mcp/mcp-management.md +85 -85
  147. package/dist/extensions/builtin/recap/AGENT.md +15 -15
  148. package/dist/extensions/builtin/sal/README.md +72 -72
  149. package/dist/extensions/builtin/security-audit/README.md +289 -289
  150. package/dist/extensions/builtin/team/AGENT.md +112 -112
  151. package/dist/extensions/builtin/team/TESTING.md +299 -299
  152. package/dist/extensions/builtin/token-save/README.md +56 -56
  153. package/dist/extensions/optional/AGENT.md +10 -10
  154. package/dist/modes/interactive/interactive-mode.js +36 -36
  155. package/dist/modes/interactive/theme/dark.json +85 -85
  156. package/dist/modes/interactive/theme/light.json +84 -84
  157. package/dist/modes/interactive/theme/theme-schema.json +335 -335
  158. package/dist/modes/interactive/theme/warm.json +81 -81
  159. package/dist/node_modules/@pencil-agent/agent-core/dist/agent-loop.js +3 -2
  160. package/dist/node_modules/@pencil-agent/agent-core/dist/structured-adaptive-agent-loop.js +2 -1
  161. package/dist/node_modules/@pencil-agent/ai/dist/cli.js +0 -0
  162. package/docs/cc-agent-design.md +1297 -0
  163. package/docs/cc-tui-design.md +1333 -0
  164. package/docs/codex-goal-command-impl.md +1055 -1055
  165. package/docs/codex-goal-vs-grub.md +500 -500
  166. package/docs/custom-provider.md +27 -27
  167. package/docs/extensions.md +27 -27
  168. package/docs/keybindings.md +27 -27
  169. package/docs/loop /351/207/215/346/236/204/345/256/214/346/210/220/346/200/273/347/273/223.md" +250 -250
  170. package/docs/loop /351/207/215/346/236/204/345/256/214/346/210/220/346/212/245/345/221/212.md" +122 -122
  171. package/docs/loop /351/207/215/346/236/204/346/226/271/346/241/210.md" +1222 -1222
  172. package/docs/loop /351/207/215/346/236/204/346/226/271/346/241/210/345/256/236/347/216/260/346/212/245/345/221/212.md" +158 -158
  173. package/docs/loop /351/207/215/346/236/204/346/226/271/346/241/210/345/257/271/346/257/224/345/210/206/346/236/220.md" +128 -128
  174. package/docs/loop /351/207/215/346/236/204/350/256/241/345/210/222.md" +320 -320
  175. package/docs/loop-usage-examples.md +214 -214
  176. package/docs/models.md +27 -27
  177. package/docs/nanoPencil-/345/255/246/344/271/240/350/256/241/345/210/222.md +170 -0
  178. package/docs/packages.md +27 -27
  179. package/docs/pi-design-philosophy.md +457 -457
  180. package/docs/planmode.md +1987 -1987
  181. package/docs/prompt-templates.md +27 -27
  182. package/docs/providers.md +27 -27
  183. package/docs/scan-report.md +3820 -0
  184. package/docs/sdk.md +27 -27
  185. package/docs/skills.md +27 -27
  186. package/docs/themes.md +27 -27
  187. package/docs/tui.md +27 -27
  188. package/docs//345/257/271/346/240/207Claude-Code.md +1775 -0
  189. package/docs//351/230/277/351/207/214/345/267/264/345/267/264/350/264/242/346/212/245/345/210/206/346/236/220/344/271/246.md +261 -0
  190. package/package.json +190 -190
  191. package/docs/ACP/345/215/217/350/256/256/351/233/206/346/210/220/345/274/200/345/217/221/346/226/207/346/241/243.md +0 -851
  192. package/docs/SDK-TESTING.md +0 -364
  193. package/docs/mem-core/346/212/200/346/234/257/346/226/207/346/241/243.md +0 -593
  194. package/docs/startup-performance-optimization.md +0 -301
  195. package/docs//350/256/244/347/237/245/345/234/260/345/233/276.md +0 -47
@@ -1,444 +1,444 @@
1
- # Walmart — Product Search & Data Extraction
2
-
3
- Field-tested against walmart.com on 2026-04-18 using `http_get` (no browser required).
4
- All code blocks were run and outputs verified against live responses.
5
-
6
- ---
7
-
8
- ## Fastest Approach: `http_get` with `__NEXT_DATA__`
9
-
10
- Walmart's Next.js SSR embeds the full search or product payload as JSON in a
11
- `<script id="__NEXT_DATA__">` tag. **No browser needed for search or product detail pages.**
12
- ~2–3 s per page fetch; no CAPTCHA or session cookies required.
13
-
14
- ### Critical UA rule
15
-
16
- | User-Agent | Result |
17
- |---|---|
18
- | `Mozilla/5.0` (bare) | Full HTML + `__NEXT_DATA__` — **use this** |
19
- | `Mozilla/5.0 ... Chrome/120 ...` (full) | PerimeterX "Robot or human?" challenge (200, 15 KB) |
20
- | `Safari/17` full UA | Works (full HTML, ~1.15 MB) |
21
- | `curl/7.x` | PerimeterX challenge |
22
- | `python-requests/2.31` | PerimeterX challenge |
23
-
24
- The bare `Mozilla/5.0` string bypasses PerimeterX. Any UA that looks like a headless
25
- client or includes a recognizable browser fingerprint triggers the JS challenge page.
26
-
27
- ### Base fetch helper
28
-
29
- ```python
30
- import json, re, gzip, urllib.request
31
-
32
- def fetch_walmart(url):
33
- """
34
- Fetch any walmart.com page.
35
- Returns decoded HTML string.
36
- Raises RuntimeError if PerimeterX bot challenge is returned.
37
- """
38
- req = urllib.request.Request(
39
- url,
40
- headers={"User-Agent": "Mozilla/5.0", "Accept-Encoding": "gzip"},
41
- )
42
- with urllib.request.urlopen(req, timeout=20) as r:
43
- data = r.read()
44
- if r.headers.get("Content-Encoding") == "gzip":
45
- data = gzip.decompress(data)
46
- html = data.decode()
47
- if "Robot or human" in html:
48
- raise RuntimeError(f"PerimeterX challenge triggered: {url}")
49
- return html
50
-
51
- def parse_next_data(html):
52
- m = re.search(r'id="__NEXT_DATA__"[^>]*>(.*?)</script>', html, re.DOTALL)
53
- if not m:
54
- raise ValueError("__NEXT_DATA__ not found — page structure may have changed")
55
- return json.loads(m.group(1))
56
- ```
57
-
58
- ---
59
-
60
- ## Search Results
61
-
62
- ### URL patterns
63
-
64
- ```python
65
- # Keyword search
66
- "https://www.walmart.com/search?q=laptop"
67
-
68
- # Pagination — append &page=N
69
- "https://www.walmart.com/search?q=laptop&page=2"
70
-
71
- # Sort options (confirmed working)
72
- "https://www.walmart.com/search?q=laptop&sort=best_match" # default
73
- "https://www.walmart.com/search?q=laptop&sort=best_seller"
74
- "https://www.walmart.com/search?q=laptop&sort=price_low"
75
- "https://www.walmart.com/search?q=laptop&sort=customer_rating"
76
-
77
- # Price filter
78
- "https://www.walmart.com/search?q=laptop&min_price=200&max_price=500"
79
-
80
- # Browse by category (department ID path)
81
- "https://www.walmart.com/browse/electronics/laptops/3944_1089430_3951"
82
- ```
83
-
84
- ### `__NEXT_DATA__` path to items
85
-
86
- ```
87
- data
88
- .props.pageProps.initialData.searchResult
89
- .aggregatedCount — int: total matching products (e.g. 18818)
90
- .paginationV2.maxPage — int: last page number
91
- .itemStacks[] — array of stacks (usually 2: sponsored + organic)
92
- .items[] — array of product objects
93
- ```
94
-
95
- ### Full extractor (field-tested)
96
-
97
- ```python
98
- def extract_search_results(html):
99
- """
100
- Returns (items, total_count, max_page).
101
- items is a list of dicts with confirmed fields.
102
- """
103
- data = parse_next_data(html)
104
- sr = data["props"]["pageProps"]["initialData"]["searchResult"]
105
-
106
- items = []
107
- for stack in sr.get("itemStacks", []):
108
- for item in stack.get("items", []):
109
- pi = item.get("priceInfo") or {}
110
- img = item.get("imageInfo") or {}
111
- rating = item.get("rating") or {}
112
- avail = item.get("availabilityStatusV2") or {}
113
- items.append({
114
- "usItemId": item.get("usItemId"), # str, Walmart item ID
115
- "name": item.get("name"), # str
116
- "brand": item.get("brand"), # str or None
117
- "price": item.get("price"), # int, current price in USD
118
- "linePrice": pi.get("linePrice"), # str "$429.00"
119
- "wasPrice": pi.get("wasPrice") or None, # str "$699.00" or None
120
- "savings": pi.get("savings") or None, # str "SAVE $270.00" or None
121
- "averageRating": rating.get("averageRating"), # float e.g. 4.3
122
- "numberOfReviews": rating.get("numberOfReviews"), # int
123
- "availability": avail.get("value"), # "IN_STOCK" / "OUT_OF_STOCK"
124
- "isSponsored": bool(item.get("isSponsoredFlag")),
125
- "url": "https://www.walmart.com" + (item.get("canonicalUrl") or "").split("?")[0],
126
- "thumbnailUrl": img.get("thumbnailUrl"),
127
- })
128
-
129
- total = sr.get("aggregatedCount")
130
- max_page = (sr.get("paginationV2") or {}).get("maxPage")
131
- return items, total, max_page
132
-
133
-
134
- # Usage
135
- html = fetch_walmart("https://www.walmart.com/search?q=laptop")
136
- items, total, max_page = extract_search_results(html)
137
- # items: 66 items on page 1, total=18818, max_page=11
138
-
139
- # Filter out sponsored
140
- organic = [i for i in items if not i["isSponsored"]]
141
- ```
142
-
143
- ### Field notes (confirmed)
144
-
145
- - **`usItemId`**: string, matches the numeric ID at the end of `/ip/.../ITEMID` URLs.
146
- Some non-product rows (ad widgets) have `usItemId=None` — filter with `if item.get("usItemId")`.
147
- - **`price`**: integer cents-less price (e.g. `429` for "$429.00"). Use `priceInfo.linePrice` for
148
- the formatted string including the dollar sign.
149
- - **`wasPrice` / `savings`**: only present when item is on sale. Always `None` for full-price items.
150
- - **`isSponsoredFlag`**: the first batch of results across both itemStacks are frequently sponsored.
151
- On a laptop search, ~56 of 66 SSR items carry `isSponsoredFlag: true`.
152
- - **`rating`**: present on ~91% of items (60/66 in test). `averageRating` is a float; `numberOfReviews` is int.
153
- - **`canonicalUrl`**: always includes `?classType=...&athbdg=...` query params — strip with `.split("?")[0]`
154
- to get a clean URL.
155
- - **Two itemStacks**: Walmart returns two stacks (`itemStacks[0]` and `itemStacks[1]`). Merge them.
156
- `itemStacks[0]` is the primary grid; `itemStacks[1]` is a secondary sponsored/related block.
157
-
158
- ### Pagination
159
-
160
- ```python
161
- for page in range(1, max_page + 1):
162
- html = fetch_walmart(f"https://www.walmart.com/search?q=laptop&page={page}")
163
- items, _, _ = extract_search_results(html)
164
- # process items...
165
- ```
166
-
167
- Page responses average ~2.5 s each. No rate-limiting was observed across 3 sequential requests.
168
- For bulk scraping, add a 1–2 s delay between requests to be safe.
169
-
170
- ---
171
-
172
- ## Product Detail Page
173
-
174
- ### URL pattern
175
-
176
- ```
177
- https://www.walmart.com/ip/{slug}/{usItemId}
178
- ```
179
-
180
- The slug is ignored in routing — only the numeric `usItemId` matters.
181
- These work identically:
182
- ```
183
- https://www.walmart.com/ip/anything/19717318352
184
- https://www.walmart.com/ip/Apple-MacBook-Neo/19717318352
185
- ```
186
-
187
- ### `__NEXT_DATA__` path on a product page
188
-
189
- ```
190
- data.props.pageProps.initialData.data
191
- .product — core product object
192
- .idml — long description, specs, highlights, warranty
193
- .reviews — rating breakdown + first 10 customer reviews (SSR)
194
- ```
195
-
196
- ### Full extractor (field-tested)
197
-
198
- ```python
199
- def extract_product_detail(html):
200
- """
201
- Returns a dict with all confirmed product fields.
202
- idml.specifications returns all spec rows as a flat dict.
203
- reviews returns the SSR-rendered first 10 customer reviews.
204
- """
205
- data = parse_next_data(html)
206
- d = data["props"]["pageProps"]["initialData"]["data"]
207
- product = d["product"]
208
- idml = d.get("idml") or {}
209
- reviews = d.get("reviews") or {}
210
-
211
- pi = product.get("priceInfo") or {}
212
- cp = pi.get("currentPrice") or {}
213
- img = product.get("imageInfo") or {}
214
- avail = product.get("availabilityStatusV2") or {}
215
-
216
- specs = {
217
- spec.get("name"): spec.get("value")
218
- for spec in (idml.get("specifications") or [])
219
- }
220
-
221
- all_images = [
222
- img_item.get("url")
223
- for img_item in (img.get("allImages") or [])
224
- if img_item.get("url")
225
- ]
226
-
227
- customer_reviews = [
228
- {
229
- "title": r.get("reviewTitle"),
230
- "rating": r.get("rating"), # int 1-5 (field is "rating", NOT "overallRating")
231
- "text": r.get("reviewText"),
232
- "author": r.get("userNickname"),
233
- "date": r.get("reviewSubmissionTime"),
234
- }
235
- for r in (reviews.get("customerReviews") or [])
236
- ]
237
-
238
- return {
239
- # identity
240
- "usItemId": product.get("usItemId"),
241
- "name": product.get("name"),
242
- "brand": product.get("brand"),
243
- "model": product.get("model"),
244
- "upc": product.get("upc"),
245
- # price
246
- "price": cp.get("price"), # float, e.g. 599
247
- "priceString": cp.get("priceString"), # "$599.00"
248
- "wasPrice": (pi.get("wasPrice") or {}).get("priceString"),
249
- "savings": (pi.get("savings") or {}).get("savingsString"),
250
- # availability
251
- "availability": avail.get("value"), # "IN_STOCK" / "OUT_OF_STOCK"
252
- "availabilityDisplay": avail.get("display"), # "In stock"
253
- # ratings
254
- "averageRating": product.get("averageRating"),
255
- "numberOfReviews": product.get("numberOfReviews"),
256
- # text
257
- "shortDescription": product.get("shortDescription"),
258
- "longDescription": idml.get("longDescription"), # HTML string
259
- # media
260
- "thumbnailUrl": img.get("thumbnailUrl"),
261
- "allImages": all_images, # up to 10 image URLs
262
- # specs
263
- "specifications": specs, # {"Brand": "Apple", "Processor": "A18 Pro", ...}
264
- "highlights": [ # top highlighted specs with icons
265
- {"name": h.get("name"), "value": h.get("value")}
266
- for h in (idml.get("productHighlights") or [])
267
- ],
268
- # URL
269
- "canonicalUrl": "https://www.walmart.com" + (product.get("canonicalUrl") or ""),
270
- # fulfillment
271
- "fulfillmentOptions": product.get("fulfillmentOptions") or [],
272
- # reviews (SSR-rendered, first 10)
273
- "reviewSummary": {
274
- "averageOverallRating": reviews.get("averageOverallRating"),
275
- "totalReviewCount": reviews.get("totalReviewCount"),
276
- "reviewsWithTextCount": reviews.get("reviewsWithTextCount"),
277
- "recommendedPercentage": reviews.get("recommendedPercentage"),
278
- },
279
- "customerReviews": customer_reviews,
280
- }
281
-
282
-
283
- # Usage
284
- url = "https://www.walmart.com/ip/Apple-MacBook-Neo/19717318352"
285
- html = fetch_walmart(url)
286
- product = extract_product_detail(html)
287
-
288
- # Example output (confirmed live):
289
- # product["name"] → "Apple MacBook Neo 13-inch Apple A18 Pro chip..."
290
- # product["price"] → 599
291
- # product["priceString"] → "$599.00"
292
- # product["availability"] → "IN_STOCK"
293
- # product["model"] → "MHFD4LL/A"
294
- # product["upc"] → "195950852745"
295
- # len(product["specifications"]) → 29 spec rows
296
- # len(product["allImages"]) → 10
297
- # product["specifications"]["Processor"] → "A18 Pro"
298
- ```
299
-
300
- ### Field notes (confirmed)
301
-
302
- - **`averageRating` / `numberOfReviews`** on the product node: present for items with reviews.
303
- New/few-review items may return `None` for both.
304
- - **`reviewSummary.averageOverallRating`** in the reviews node often differs slightly from
305
- `product.averageRating` — the reviews node is more precise (e.g. `4.75` vs `4.8`).
306
- - **`customerReviews`** (SSR): always the first 10 reviews. The per-review rating field is `"rating"`
307
- (int 1–5), **not** `"overallRating"` (which is always `None`).
308
- - **`longDescription`**: raw HTML string including `<ul>/<li>` tags. Strip tags before display.
309
- - **`specifications`**: flat dict — confirmed 29–31 rows for electronics. Key names use display labels
310
- (e.g. `"RAM memory"`, `"Screen size"`, `"HD capacity"`).
311
- - **`wasPrice` / `savings`** on detail page: same as search — `None` when item is not discounted.
312
- - **No JSON-LD**: Walmart product pages do **not** include `<script type="application/ld+json">`.
313
- All structured data lives in `__NEXT_DATA__`.
314
-
315
- ---
316
-
317
- ## Anti-Bot: PerimeterX
318
-
319
- Walmart uses **PerimeterX** (app ID `PXu6b0qd2S`, confirmed in `runtimeConfig.perimeterX`).
320
-
321
- | Signal | Detail |
322
- |---|---|
323
- | Bot detector | PerimeterX |
324
- | Challenge page | "Robot or human?" — 200 OK, 15 KB HTML |
325
- | Triggered by | Full browser UA strings (Chrome, curl, python-requests) |
326
- | Bypassed by | `User-Agent: Mozilla/5.0` (bare prefix only) |
327
- | No JS execution | SSR response is complete — no JS challenge to solve |
328
-
329
- Detection in code:
330
- ```python
331
- if "Robot or human" in html:
332
- raise RuntimeError("PerimeterX challenge — switch to browser harness")
333
- ```
334
-
335
- If `http_get` starts returning the challenge after a run of successful fetches, switch to the
336
- browser harness (see below).
337
-
338
- ---
339
-
340
- ## Browser Harness Fallback
341
-
342
- Use the browser harness when:
343
- - PerimeterX starts blocking `http_get` on your IP
344
- - You need to interact with the page (add to cart, filter UI, infinite scroll)
345
- - You need variant switching (color/size selectors)
346
-
347
- ```python
348
- # Browser-based search extraction
349
- new_tab("https://www.walmart.com/search?q=laptop")
350
- wait_for_load()
351
- wait(2) # JS renders product cards after readyState=complete
352
-
353
- # Extract via __NEXT_DATA__ in-browser (identical structure to http_get)
354
- import json
355
- nd = js("document.getElementById('__NEXT_DATA__')?.textContent")
356
- data = json.loads(nd)
357
- sr = data["props"]["pageProps"]["initialData"]["searchResult"]
358
- items = []
359
- for stack in sr.get("itemStacks", []):
360
- items.extend(stack.get("items", []))
361
- ```
362
-
363
- ### Browser selectors (confirmed working for DOM-based extraction)
364
-
365
- ```python
366
- # Product cards on search results page
367
- results = js("""
368
- Array.from(document.querySelectorAll('[data-item-id]')).map(el => ({
369
- itemId: el.getAttribute('data-item-id'),
370
- name: el.querySelector('[itemprop="name"]')?.innerText?.trim(),
371
- price: el.querySelector('[itemprop="price"]')?.getAttribute('content'),
372
- url: el.querySelector('a[link-identifier]')?.href,
373
- })).filter(r => r.itemId)
374
- """)
375
-
376
- # If [data-item-id] misses items, use the Next.js data attribute alternative:
377
- results_alt = js("""
378
- Array.from(document.querySelectorAll('[data-testid="list-view"]'))
379
- .map(el => el.innerText.trim())
380
- """)
381
- ```
382
-
383
- > **Prefer `__NEXT_DATA__` over DOM selectors** even in-browser — the JSON is complete and
384
- > stable. DOM class names at Walmart are obfuscated and change between deployments.
385
-
386
- ### Session gotcha
387
-
388
- Always open Walmart with `new_tab()` on first visit:
389
- ```python
390
- new_tab("https://www.walmart.com/search?q=laptop")
391
- wait_for_load()
392
- wait(2)
393
- ```
394
- After that, `goto_url()` works normally within the same session.
395
-
396
- ---
397
-
398
- ## Public API
399
-
400
- Walmart's affiliate/partner API (`developer.api.walmart.com`) requires a registered API key
401
- and returns HTTP 403 without one. No unauthenticated public product API is available.
402
- The `__NEXT_DATA__` SSR approach replaces any need for the official API for read-only data.
403
-
404
- ---
405
-
406
- ## Gotchas
407
-
408
- - **UA must be `Mozilla/5.0` bare**: Any fuller string (Chrome, Safari, curl, requests) hits
409
- PerimeterX. This is counterintuitive — the *shorter*, less realistic UA is the one that works.
410
-
411
- - **Regex must use `id=` attribute match**: The regex
412
- `r'<script id="__NEXT_DATA__" type="application/json">...'` fails because the actual tag is
413
- `<script id="__NEXT_DATA__">` without `type`. Use:
414
- ```python
415
- re.search(r'id="__NEXT_DATA__"[^>]*>(.*?)</script>', html, re.DOTALL)
416
- ```
417
-
418
- - **`usItemId` can be `None`**: ~5/66 items on a page are non-product ad widgets with no `usItemId`.
419
- Always filter: `[i for i in items if i.get("usItemId")]`.
420
-
421
- - **Two `itemStacks`**: Walmart returns two stacks. Iterate over all stacks or you'll miss
422
- ~10 items from the second stack.
423
-
424
- - **`canonicalUrl` includes tracking params**: Always strip with `.split("?")[0]`.
425
-
426
- - **Review field is `"rating"` not `"overallRating"`**: Each `customerReviews` entry has a `"rating"`
427
- int field (1–5). The `"overallRating"` field is always `None`. Don't confuse with
428
- `product.averageRating` (the aggregate float).
429
-
430
- - **No JSON-LD on product pages**: Zero `<script type="application/ld+json">` tags were found.
431
- All structured data is in `__NEXT_DATA__`.
432
-
433
- - **`longDescription` is HTML**: Strip tags before text use. May contain promotional/financing copy
434
- mixed with real product description.
435
-
436
- - **Page sizes vary**: Page 1 returned 66 items across 2 stacks; page 2 returned 55.
437
- Do not assume a fixed items-per-page count.
438
-
439
- - **`http_get` default already sends `Mozilla/5.0`**: `helpers.http_get()` uses
440
- `"User-Agent": "Mozilla/5.0"` by default — no override needed when calling it directly.
441
- Only pass a custom `headers=` if you need to change something else.
442
-
443
- - **`developer.api.walmart.com`** returns HTTP 403 without an API key. Not usable for
444
- unauthenticated scraping.
1
+ # Walmart — Product Search & Data Extraction
2
+
3
+ Field-tested against walmart.com on 2026-04-18 using `http_get` (no browser required).
4
+ All code blocks were run and outputs verified against live responses.
5
+
6
+ ---
7
+
8
+ ## Fastest Approach: `http_get` with `__NEXT_DATA__`
9
+
10
+ Walmart's Next.js SSR embeds the full search or product payload as JSON in a
11
+ `<script id="__NEXT_DATA__">` tag. **No browser needed for search or product detail pages.**
12
+ ~2–3 s per page fetch; no CAPTCHA or session cookies required.
13
+
14
+ ### Critical UA rule
15
+
16
+ | User-Agent | Result |
17
+ |---|---|
18
+ | `Mozilla/5.0` (bare) | Full HTML + `__NEXT_DATA__` — **use this** |
19
+ | `Mozilla/5.0 ... Chrome/120 ...` (full) | PerimeterX "Robot or human?" challenge (200, 15 KB) |
20
+ | `Safari/17` full UA | Works (full HTML, ~1.15 MB) |
21
+ | `curl/7.x` | PerimeterX challenge |
22
+ | `python-requests/2.31` | PerimeterX challenge |
23
+
24
+ The bare `Mozilla/5.0` string bypasses PerimeterX. Any UA that looks like a headless
25
+ client or includes a recognizable browser fingerprint triggers the JS challenge page.
26
+
27
+ ### Base fetch helper
28
+
29
+ ```python
30
+ import json, re, gzip, urllib.request
31
+
32
+ def fetch_walmart(url):
33
+ """
34
+ Fetch any walmart.com page.
35
+ Returns decoded HTML string.
36
+ Raises RuntimeError if PerimeterX bot challenge is returned.
37
+ """
38
+ req = urllib.request.Request(
39
+ url,
40
+ headers={"User-Agent": "Mozilla/5.0", "Accept-Encoding": "gzip"},
41
+ )
42
+ with urllib.request.urlopen(req, timeout=20) as r:
43
+ data = r.read()
44
+ if r.headers.get("Content-Encoding") == "gzip":
45
+ data = gzip.decompress(data)
46
+ html = data.decode()
47
+ if "Robot or human" in html:
48
+ raise RuntimeError(f"PerimeterX challenge triggered: {url}")
49
+ return html
50
+
51
+ def parse_next_data(html):
52
+ m = re.search(r'id="__NEXT_DATA__"[^>]*>(.*?)</script>', html, re.DOTALL)
53
+ if not m:
54
+ raise ValueError("__NEXT_DATA__ not found — page structure may have changed")
55
+ return json.loads(m.group(1))
56
+ ```
57
+
58
+ ---
59
+
60
+ ## Search Results
61
+
62
+ ### URL patterns
63
+
64
+ ```python
65
+ # Keyword search
66
+ "https://www.walmart.com/search?q=laptop"
67
+
68
+ # Pagination — append &page=N
69
+ "https://www.walmart.com/search?q=laptop&page=2"
70
+
71
+ # Sort options (confirmed working)
72
+ "https://www.walmart.com/search?q=laptop&sort=best_match" # default
73
+ "https://www.walmart.com/search?q=laptop&sort=best_seller"
74
+ "https://www.walmart.com/search?q=laptop&sort=price_low"
75
+ "https://www.walmart.com/search?q=laptop&sort=customer_rating"
76
+
77
+ # Price filter
78
+ "https://www.walmart.com/search?q=laptop&min_price=200&max_price=500"
79
+
80
+ # Browse by category (department ID path)
81
+ "https://www.walmart.com/browse/electronics/laptops/3944_1089430_3951"
82
+ ```
83
+
84
+ ### `__NEXT_DATA__` path to items
85
+
86
+ ```
87
+ data
88
+ .props.pageProps.initialData.searchResult
89
+ .aggregatedCount — int: total matching products (e.g. 18818)
90
+ .paginationV2.maxPage — int: last page number
91
+ .itemStacks[] — array of stacks (usually 2: sponsored + organic)
92
+ .items[] — array of product objects
93
+ ```
94
+
95
+ ### Full extractor (field-tested)
96
+
97
+ ```python
98
+ def extract_search_results(html):
99
+ """
100
+ Returns (items, total_count, max_page).
101
+ items is a list of dicts with confirmed fields.
102
+ """
103
+ data = parse_next_data(html)
104
+ sr = data["props"]["pageProps"]["initialData"]["searchResult"]
105
+
106
+ items = []
107
+ for stack in sr.get("itemStacks", []):
108
+ for item in stack.get("items", []):
109
+ pi = item.get("priceInfo") or {}
110
+ img = item.get("imageInfo") or {}
111
+ rating = item.get("rating") or {}
112
+ avail = item.get("availabilityStatusV2") or {}
113
+ items.append({
114
+ "usItemId": item.get("usItemId"), # str, Walmart item ID
115
+ "name": item.get("name"), # str
116
+ "brand": item.get("brand"), # str or None
117
+ "price": item.get("price"), # int, current price in USD
118
+ "linePrice": pi.get("linePrice"), # str "$429.00"
119
+ "wasPrice": pi.get("wasPrice") or None, # str "$699.00" or None
120
+ "savings": pi.get("savings") or None, # str "SAVE $270.00" or None
121
+ "averageRating": rating.get("averageRating"), # float e.g. 4.3
122
+ "numberOfReviews": rating.get("numberOfReviews"), # int
123
+ "availability": avail.get("value"), # "IN_STOCK" / "OUT_OF_STOCK"
124
+ "isSponsored": bool(item.get("isSponsoredFlag")),
125
+ "url": "https://www.walmart.com" + (item.get("canonicalUrl") or "").split("?")[0],
126
+ "thumbnailUrl": img.get("thumbnailUrl"),
127
+ })
128
+
129
+ total = sr.get("aggregatedCount")
130
+ max_page = (sr.get("paginationV2") or {}).get("maxPage")
131
+ return items, total, max_page
132
+
133
+
134
+ # Usage
135
+ html = fetch_walmart("https://www.walmart.com/search?q=laptop")
136
+ items, total, max_page = extract_search_results(html)
137
+ # items: 66 items on page 1, total=18818, max_page=11
138
+
139
+ # Filter out sponsored
140
+ organic = [i for i in items if not i["isSponsored"]]
141
+ ```
142
+
143
+ ### Field notes (confirmed)
144
+
145
+ - **`usItemId`**: string, matches the numeric ID at the end of `/ip/.../ITEMID` URLs.
146
+ Some non-product rows (ad widgets) have `usItemId=None` — filter with `if item.get("usItemId")`.
147
+ - **`price`**: integer cents-less price (e.g. `429` for "$429.00"). Use `priceInfo.linePrice` for
148
+ the formatted string including the dollar sign.
149
+ - **`wasPrice` / `savings`**: only present when item is on sale. Always `None` for full-price items.
150
+ - **`isSponsoredFlag`**: the first batch of results across both itemStacks are frequently sponsored.
151
+ On a laptop search, ~56 of 66 SSR items carry `isSponsoredFlag: true`.
152
+ - **`rating`**: present on ~91% of items (60/66 in test). `averageRating` is a float; `numberOfReviews` is int.
153
+ - **`canonicalUrl`**: always includes `?classType=...&athbdg=...` query params — strip with `.split("?")[0]`
154
+ to get a clean URL.
155
+ - **Two itemStacks**: Walmart returns two stacks (`itemStacks[0]` and `itemStacks[1]`). Merge them.
156
+ `itemStacks[0]` is the primary grid; `itemStacks[1]` is a secondary sponsored/related block.
157
+
158
+ ### Pagination
159
+
160
+ ```python
161
+ for page in range(1, max_page + 1):
162
+ html = fetch_walmart(f"https://www.walmart.com/search?q=laptop&page={page}")
163
+ items, _, _ = extract_search_results(html)
164
+ # process items...
165
+ ```
166
+
167
+ Page responses average ~2.5 s each. No rate-limiting was observed across 3 sequential requests.
168
+ For bulk scraping, add a 1–2 s delay between requests to be safe.
169
+
170
+ ---
171
+
172
+ ## Product Detail Page
173
+
174
+ ### URL pattern
175
+
176
+ ```
177
+ https://www.walmart.com/ip/{slug}/{usItemId}
178
+ ```
179
+
180
+ The slug is ignored in routing — only the numeric `usItemId` matters.
181
+ These work identically:
182
+ ```
183
+ https://www.walmart.com/ip/anything/19717318352
184
+ https://www.walmart.com/ip/Apple-MacBook-Neo/19717318352
185
+ ```
186
+
187
+ ### `__NEXT_DATA__` path on a product page
188
+
189
+ ```
190
+ data.props.pageProps.initialData.data
191
+ .product — core product object
192
+ .idml — long description, specs, highlights, warranty
193
+ .reviews — rating breakdown + first 10 customer reviews (SSR)
194
+ ```
195
+
196
+ ### Full extractor (field-tested)
197
+
198
+ ```python
199
+ def extract_product_detail(html):
200
+ """
201
+ Returns a dict with all confirmed product fields.
202
+ idml.specifications returns all spec rows as a flat dict.
203
+ reviews returns the SSR-rendered first 10 customer reviews.
204
+ """
205
+ data = parse_next_data(html)
206
+ d = data["props"]["pageProps"]["initialData"]["data"]
207
+ product = d["product"]
208
+ idml = d.get("idml") or {}
209
+ reviews = d.get("reviews") or {}
210
+
211
+ pi = product.get("priceInfo") or {}
212
+ cp = pi.get("currentPrice") or {}
213
+ img = product.get("imageInfo") or {}
214
+ avail = product.get("availabilityStatusV2") or {}
215
+
216
+ specs = {
217
+ spec.get("name"): spec.get("value")
218
+ for spec in (idml.get("specifications") or [])
219
+ }
220
+
221
+ all_images = [
222
+ img_item.get("url")
223
+ for img_item in (img.get("allImages") or [])
224
+ if img_item.get("url")
225
+ ]
226
+
227
+ customer_reviews = [
228
+ {
229
+ "title": r.get("reviewTitle"),
230
+ "rating": r.get("rating"), # int 1-5 (field is "rating", NOT "overallRating")
231
+ "text": r.get("reviewText"),
232
+ "author": r.get("userNickname"),
233
+ "date": r.get("reviewSubmissionTime"),
234
+ }
235
+ for r in (reviews.get("customerReviews") or [])
236
+ ]
237
+
238
+ return {
239
+ # identity
240
+ "usItemId": product.get("usItemId"),
241
+ "name": product.get("name"),
242
+ "brand": product.get("brand"),
243
+ "model": product.get("model"),
244
+ "upc": product.get("upc"),
245
+ # price
246
+ "price": cp.get("price"), # float, e.g. 599
247
+ "priceString": cp.get("priceString"), # "$599.00"
248
+ "wasPrice": (pi.get("wasPrice") or {}).get("priceString"),
249
+ "savings": (pi.get("savings") or {}).get("savingsString"),
250
+ # availability
251
+ "availability": avail.get("value"), # "IN_STOCK" / "OUT_OF_STOCK"
252
+ "availabilityDisplay": avail.get("display"), # "In stock"
253
+ # ratings
254
+ "averageRating": product.get("averageRating"),
255
+ "numberOfReviews": product.get("numberOfReviews"),
256
+ # text
257
+ "shortDescription": product.get("shortDescription"),
258
+ "longDescription": idml.get("longDescription"), # HTML string
259
+ # media
260
+ "thumbnailUrl": img.get("thumbnailUrl"),
261
+ "allImages": all_images, # up to 10 image URLs
262
+ # specs
263
+ "specifications": specs, # {"Brand": "Apple", "Processor": "A18 Pro", ...}
264
+ "highlights": [ # top highlighted specs with icons
265
+ {"name": h.get("name"), "value": h.get("value")}
266
+ for h in (idml.get("productHighlights") or [])
267
+ ],
268
+ # URL
269
+ "canonicalUrl": "https://www.walmart.com" + (product.get("canonicalUrl") or ""),
270
+ # fulfillment
271
+ "fulfillmentOptions": product.get("fulfillmentOptions") or [],
272
+ # reviews (SSR-rendered, first 10)
273
+ "reviewSummary": {
274
+ "averageOverallRating": reviews.get("averageOverallRating"),
275
+ "totalReviewCount": reviews.get("totalReviewCount"),
276
+ "reviewsWithTextCount": reviews.get("reviewsWithTextCount"),
277
+ "recommendedPercentage": reviews.get("recommendedPercentage"),
278
+ },
279
+ "customerReviews": customer_reviews,
280
+ }
281
+
282
+
283
+ # Usage
284
+ url = "https://www.walmart.com/ip/Apple-MacBook-Neo/19717318352"
285
+ html = fetch_walmart(url)
286
+ product = extract_product_detail(html)
287
+
288
+ # Example output (confirmed live):
289
+ # product["name"] → "Apple MacBook Neo 13-inch Apple A18 Pro chip..."
290
+ # product["price"] → 599
291
+ # product["priceString"] → "$599.00"
292
+ # product["availability"] → "IN_STOCK"
293
+ # product["model"] → "MHFD4LL/A"
294
+ # product["upc"] → "195950852745"
295
+ # len(product["specifications"]) → 29 spec rows
296
+ # len(product["allImages"]) → 10
297
+ # product["specifications"]["Processor"] → "A18 Pro"
298
+ ```
299
+
300
+ ### Field notes (confirmed)
301
+
302
+ - **`averageRating` / `numberOfReviews`** on the product node: present for items with reviews.
303
+ New/few-review items may return `None` for both.
304
+ - **`reviewSummary.averageOverallRating`** in the reviews node often differs slightly from
305
+ `product.averageRating` — the reviews node is more precise (e.g. `4.75` vs `4.8`).
306
+ - **`customerReviews`** (SSR): always the first 10 reviews. The per-review rating field is `"rating"`
307
+ (int 1–5), **not** `"overallRating"` (which is always `None`).
308
+ - **`longDescription`**: raw HTML string including `<ul>/<li>` tags. Strip tags before display.
309
+ - **`specifications`**: flat dict — confirmed 29–31 rows for electronics. Key names use display labels
310
+ (e.g. `"RAM memory"`, `"Screen size"`, `"HD capacity"`).
311
+ - **`wasPrice` / `savings`** on detail page: same as search — `None` when item is not discounted.
312
+ - **No JSON-LD**: Walmart product pages do **not** include `<script type="application/ld+json">`.
313
+ All structured data lives in `__NEXT_DATA__`.
314
+
315
+ ---
316
+
317
+ ## Anti-Bot: PerimeterX
318
+
319
+ Walmart uses **PerimeterX** (app ID `PXu6b0qd2S`, confirmed in `runtimeConfig.perimeterX`).
320
+
321
+ | Signal | Detail |
322
+ |---|---|
323
+ | Bot detector | PerimeterX |
324
+ | Challenge page | "Robot or human?" — 200 OK, 15 KB HTML |
325
+ | Triggered by | Full browser UA strings (Chrome, curl, python-requests) |
326
+ | Bypassed by | `User-Agent: Mozilla/5.0` (bare prefix only) |
327
+ | No JS execution | SSR response is complete — no JS challenge to solve |
328
+
329
+ Detection in code:
330
+ ```python
331
+ if "Robot or human" in html:
332
+ raise RuntimeError("PerimeterX challenge — switch to browser harness")
333
+ ```
334
+
335
+ If `http_get` starts returning the challenge after a run of successful fetches, switch to the
336
+ browser harness (see below).
337
+
338
+ ---
339
+
340
+ ## Browser Harness Fallback
341
+
342
+ Use the browser harness when:
343
+ - PerimeterX starts blocking `http_get` on your IP
344
+ - You need to interact with the page (add to cart, filter UI, infinite scroll)
345
+ - You need variant switching (color/size selectors)
346
+
347
+ ```python
348
+ # Browser-based search extraction
349
+ new_tab("https://www.walmart.com/search?q=laptop")
350
+ wait_for_load()
351
+ wait(2) # JS renders product cards after readyState=complete
352
+
353
+ # Extract via __NEXT_DATA__ in-browser (identical structure to http_get)
354
+ import json
355
+ nd = js("document.getElementById('__NEXT_DATA__')?.textContent")
356
+ data = json.loads(nd)
357
+ sr = data["props"]["pageProps"]["initialData"]["searchResult"]
358
+ items = []
359
+ for stack in sr.get("itemStacks", []):
360
+ items.extend(stack.get("items", []))
361
+ ```
362
+
363
+ ### Browser selectors (confirmed working for DOM-based extraction)
364
+
365
+ ```python
366
+ # Product cards on search results page
367
+ results = js("""
368
+ Array.from(document.querySelectorAll('[data-item-id]')).map(el => ({
369
+ itemId: el.getAttribute('data-item-id'),
370
+ name: el.querySelector('[itemprop="name"]')?.innerText?.trim(),
371
+ price: el.querySelector('[itemprop="price"]')?.getAttribute('content'),
372
+ url: el.querySelector('a[link-identifier]')?.href,
373
+ })).filter(r => r.itemId)
374
+ """)
375
+
376
+ # If [data-item-id] misses items, use the Next.js data attribute alternative:
377
+ results_alt = js("""
378
+ Array.from(document.querySelectorAll('[data-testid="list-view"]'))
379
+ .map(el => el.innerText.trim())
380
+ """)
381
+ ```
382
+
383
+ > **Prefer `__NEXT_DATA__` over DOM selectors** even in-browser — the JSON is complete and
384
+ > stable. DOM class names at Walmart are obfuscated and change between deployments.
385
+
386
+ ### Session gotcha
387
+
388
+ Always open Walmart with `new_tab()` on first visit:
389
+ ```python
390
+ new_tab("https://www.walmart.com/search?q=laptop")
391
+ wait_for_load()
392
+ wait(2)
393
+ ```
394
+ After that, `goto_url()` works normally within the same session.
395
+
396
+ ---
397
+
398
+ ## Public API
399
+
400
+ Walmart's affiliate/partner API (`developer.api.walmart.com`) requires a registered API key
401
+ and returns HTTP 403 without one. No unauthenticated public product API is available.
402
+ The `__NEXT_DATA__` SSR approach replaces any need for the official API for read-only data.
403
+
404
+ ---
405
+
406
+ ## Gotchas
407
+
408
+ - **UA must be `Mozilla/5.0` bare**: Any fuller string (Chrome, Safari, curl, requests) hits
409
+ PerimeterX. This is counterintuitive — the *shorter*, less realistic UA is the one that works.
410
+
411
+ - **Regex must use `id=` attribute match**: The regex
412
+ `r'<script id="__NEXT_DATA__" type="application/json">...'` fails because the actual tag is
413
+ `<script id="__NEXT_DATA__">` without `type`. Use:
414
+ ```python
415
+ re.search(r'id="__NEXT_DATA__"[^>]*>(.*?)</script>', html, re.DOTALL)
416
+ ```
417
+
418
+ - **`usItemId` can be `None`**: ~5/66 items on a page are non-product ad widgets with no `usItemId`.
419
+ Always filter: `[i for i in items if i.get("usItemId")]`.
420
+
421
+ - **Two `itemStacks`**: Walmart returns two stacks. Iterate over all stacks or you'll miss
422
+ ~10 items from the second stack.
423
+
424
+ - **`canonicalUrl` includes tracking params**: Always strip with `.split("?")[0]`.
425
+
426
+ - **Review field is `"rating"` not `"overallRating"`**: Each `customerReviews` entry has a `"rating"`
427
+ int field (1–5). The `"overallRating"` field is always `None`. Don't confuse with
428
+ `product.averageRating` (the aggregate float).
429
+
430
+ - **No JSON-LD on product pages**: Zero `<script type="application/ld+json">` tags were found.
431
+ All structured data is in `__NEXT_DATA__`.
432
+
433
+ - **`longDescription` is HTML**: Strip tags before text use. May contain promotional/financing copy
434
+ mixed with real product description.
435
+
436
+ - **Page sizes vary**: Page 1 returned 66 items across 2 stacks; page 2 returned 55.
437
+ Do not assume a fixed items-per-page count.
438
+
439
+ - **`http_get` default already sends `Mozilla/5.0`**: `helpers.http_get()` uses
440
+ `"User-Agent": "Mozilla/5.0"` by default — no override needed when calling it directly.
441
+ Only pass a custom `headers=` if you need to change something else.
442
+
443
+ - **`developer.api.walmart.com`** returns HTTP 403 without an API key. Not usable for
444
+ unauthenticated scraping.