@pencil-agent/nano-pencil 2.0.0-beta.8 → 2.0.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
Files changed (241) hide show
  1. package/README.md +267 -267
  2. package/dist/build-meta.json +3 -3
  3. package/dist/core/export-html/AGENT.md +11 -11
  4. package/dist/core/export-html/template.css +971 -971
  5. package/dist/core/export-html/template.html +54 -54
  6. package/dist/core/extensions-host/index.d.ts +1 -1
  7. package/dist/core/extensions-host/loader.js +1 -1
  8. package/dist/core/extensions-host/runner.d.ts +1 -0
  9. package/dist/core/extensions-host/runner.js +2 -2
  10. package/dist/core/extensions-host/types.d.ts +17 -22
  11. package/dist/core/lib/ai/src/types.d.ts +12 -2
  12. package/dist/core/persona/persona-manager.js +5 -2
  13. package/dist/core/runtime/agent-session.js +3 -3
  14. package/dist/core/runtime/extension-core-bindings.d.ts +1 -0
  15. package/dist/core/runtime/extension-core-bindings.js +2 -2
  16. package/dist/extensions/builtin/AGENT.md +115 -115
  17. package/dist/extensions/builtin/browser/AGENT.md +17 -17
  18. package/dist/extensions/builtin/browser/agent-workspace/agent_helpers.py +12 -12
  19. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/amazon/product-search.md +198 -198
  20. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/archive-org/scraping.md +341 -341
  21. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/arxiv/scraping.md +311 -311
  22. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/arxiv-bulk/scraping.md +333 -333
  23. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/atlas/overview.md +70 -70
  24. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/booking-com/scraping.md +578 -578
  25. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/capterra/scraping.md +440 -440
  26. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/centilebrain/generate-estimates.md +110 -110
  27. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/coingecko/scraping.md +325 -325
  28. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/coinmarketcap/scraping.md +463 -463
  29. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/coursera/scraping.md +360 -360
  30. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/craigslist/scraping.md +390 -390
  31. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/crossref/scraping.md +568 -568
  32. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/dev-to/scraping.md +323 -323
  33. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/duckduckgo/scraping.md +349 -349
  34. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/ebay/scraping.md +435 -435
  35. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/etsy/scraping.md +506 -506
  36. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/eventbrite/scraping.md +363 -363
  37. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/expedia/automation.md +168 -168
  38. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/facebook/groups.md +236 -236
  39. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/facebook/pages.md +295 -295
  40. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/framer/editor.md +108 -108
  41. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/fred/scraping.md +493 -493
  42. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/g2/scraping.md +580 -580
  43. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/genius/scraping.md +511 -511
  44. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/github/repo-actions.md +65 -65
  45. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/github/scraping.md +184 -184
  46. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/glassdoor/scraping.md +543 -543
  47. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/gmail/compose.md +122 -122
  48. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/goodreads/scraping.md +461 -461
  49. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/gutenberg/scraping.md +383 -383
  50. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/hackernews/scraping.md +243 -243
  51. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/howlongtobeat/scraping.md +473 -473
  52. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/imdb/scraping.md +271 -271
  53. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/itch-io/scraping.md +436 -436
  54. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/job-boards/indeed-glassdoor.md +1021 -1021
  55. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/letterboxd/scraping.md +349 -349
  56. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/linkedin/invitation-manager.md +109 -109
  57. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/loom/folder-enumeration.md +170 -170
  58. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/macrotrends/scraping.md +537 -537
  59. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/medium/article-hydration.md +120 -120
  60. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/medium/scraping.md +414 -414
  61. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/metacritic/scraping.md +477 -477
  62. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/musicbrainz/scraping.md +478 -478
  63. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/nasa/scraping.md +339 -339
  64. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/news-aggregation/multi-source.md +205 -205
  65. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/open-library/scraping.md +472 -472
  66. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/openalex/scraping.md +470 -470
  67. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/openstreetmap/scraping.md +490 -490
  68. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/package-registries/npm-pypi.md +478 -478
  69. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/polymarket/scraping.md +234 -234
  70. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/producthunt/scraping.md +307 -307
  71. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/pubmed/scraping.md +421 -421
  72. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/quora/scraping.md +364 -364
  73. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/rawg/scraping.md +352 -352
  74. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/reddit/scraping.md +124 -124
  75. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/rest-countries/scraping.md +233 -233
  76. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/sec-edgar/scraping.md +361 -361
  77. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/shopify-admin/README.md +36 -36
  78. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/shopify-admin/embedded-apps.md +72 -72
  79. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/shopify-admin/knowledge-base.md +109 -109
  80. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/shopify-admin/polaris-inputs.md +137 -137
  81. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/soundcloud/scraping.md +362 -362
  82. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/spotify/scraping.md +339 -339
  83. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/stackoverflow/scraping.md +435 -435
  84. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/steam/scraping.md +575 -575
  85. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/substack/scraping.md +338 -338
  86. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/thetechgeeks/pricing.md +52 -52
  87. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/tiktok/upload.md +107 -107
  88. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/tradingview/scraping.md +309 -309
  89. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/trello/boards-and-lists.md +88 -88
  90. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/trustpilot/scraping.md +375 -375
  91. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/walmart/scraping.md +444 -444
  92. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/wayback-machine/scraping.md +306 -306
  93. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/weather/scraping.md +398 -398
  94. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/wellfound/scraping.md +596 -596
  95. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/world-bank/scraping.md +356 -356
  96. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/xiaohongshu/scraping.md +84 -84
  97. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/youtube/scraping.md +418 -418
  98. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/zillow/scraping.md +433 -433
  99. package/dist/extensions/builtin/browser/browser.md +73 -73
  100. package/dist/extensions/builtin/browser/install.md +142 -142
  101. package/dist/extensions/builtin/browser/interaction-skills/connection.md +48 -48
  102. package/dist/extensions/builtin/browser/interaction-skills/cookies.md +3 -3
  103. package/dist/extensions/builtin/browser/interaction-skills/cross-origin-iframes.md +3 -3
  104. package/dist/extensions/builtin/browser/interaction-skills/dialogs.md +64 -64
  105. package/dist/extensions/builtin/browser/interaction-skills/downloads.md +3 -3
  106. package/dist/extensions/builtin/browser/interaction-skills/drag-and-drop.md +3 -3
  107. package/dist/extensions/builtin/browser/interaction-skills/dropdowns.md +3 -3
  108. package/dist/extensions/builtin/browser/interaction-skills/iframes.md +3 -3
  109. package/dist/extensions/builtin/browser/interaction-skills/network-requests.md +3 -3
  110. package/dist/extensions/builtin/browser/interaction-skills/print-as-pdf.md +3 -3
  111. package/dist/extensions/builtin/browser/interaction-skills/profile-sync.md +90 -90
  112. package/dist/extensions/builtin/browser/interaction-skills/screenshots.md +17 -17
  113. package/dist/extensions/builtin/browser/interaction-skills/scrolling.md +3 -3
  114. package/dist/extensions/builtin/browser/interaction-skills/shadow-dom.md +3 -3
  115. package/dist/extensions/builtin/browser/interaction-skills/tabs.md +69 -69
  116. package/dist/extensions/builtin/browser/interaction-skills/uploads.md +1 -1
  117. package/dist/extensions/builtin/browser/interaction-skills/viewport.md +3 -3
  118. package/dist/extensions/builtin/browser/src/browser_harness/AGENT.md +15 -15
  119. package/dist/extensions/builtin/browser/src/browser_harness/__init__.py +8 -8
  120. package/dist/extensions/builtin/browser/src/browser_harness/_ipc.py +90 -90
  121. package/dist/extensions/builtin/browser/src/browser_harness/admin.py +722 -722
  122. package/dist/extensions/builtin/browser/src/browser_harness/daemon.py +328 -328
  123. package/dist/extensions/builtin/browser/src/browser_harness/helpers.py +396 -396
  124. package/dist/extensions/builtin/browser/src/browser_harness/run.py +103 -103
  125. package/dist/extensions/builtin/discipline/skills/brainstorming/SKILL.md +33 -33
  126. package/dist/extensions/builtin/discipline/skills/executing-plans/SKILL.md +25 -25
  127. package/dist/extensions/builtin/discipline/skills/finishing-development-branch/SKILL.md +25 -25
  128. package/dist/extensions/builtin/discipline/skills/receiving-code-review/SKILL.md +22 -22
  129. package/dist/extensions/builtin/discipline/skills/requesting-code-review/SKILL.md +31 -31
  130. package/dist/extensions/builtin/discipline/skills/systematic-debugging/SKILL.md +28 -28
  131. package/dist/extensions/builtin/discipline/skills/test-driven-development/SKILL.md +32 -32
  132. package/dist/extensions/builtin/discipline/skills/using-git-worktrees/SKILL.md +25 -25
  133. package/dist/extensions/builtin/discipline/skills/verification-before-completion/SKILL.md +27 -27
  134. package/dist/extensions/builtin/discipline/skills/writing-plans/SKILL.md +26 -26
  135. package/dist/extensions/builtin/goal/README.md +67 -67
  136. package/dist/extensions/builtin/goal/goal-controller.d.ts +39 -10
  137. package/dist/extensions/builtin/goal/goal-controller.js +1 -1
  138. package/dist/extensions/builtin/goal/goal-format.js +1 -1
  139. package/dist/extensions/builtin/goal/goal-prompts.d.ts +2 -0
  140. package/dist/extensions/builtin/goal/goal-prompts.js +5 -4
  141. package/dist/extensions/builtin/goal/goal-store.js +1 -1
  142. package/dist/extensions/builtin/goal/index.d.ts +1 -1
  143. package/dist/extensions/builtin/goal/index.js +10 -7
  144. package/dist/extensions/builtin/grub/README.md +112 -112
  145. package/dist/extensions/builtin/link-world/agent-workspace/README.md +16 -16
  146. package/dist/extensions/builtin/link-world/index.js +6 -6
  147. package/dist/extensions/builtin/link-world/internet-search/internet-search.md +65 -65
  148. package/dist/extensions/builtin/link-world/link-world-agent.md +82 -82
  149. package/dist/extensions/builtin/link-world/linkworld.md +313 -313
  150. package/dist/extensions/builtin/link-world/{network-routing.md → network-routing/network-routing.md} +67 -67
  151. package/dist/extensions/builtin/loop/README.md +92 -92
  152. package/dist/extensions/builtin/mcp/figma-design.md +68 -68
  153. package/dist/extensions/builtin/mcp/mcp-management.md +85 -85
  154. package/dist/extensions/builtin/plan/index.js +1 -1
  155. package/dist/extensions/builtin/recap/AGENT.md +15 -15
  156. package/dist/extensions/builtin/sal/README.md +72 -72
  157. package/dist/extensions/builtin/security-audit/README.md +289 -289
  158. package/dist/extensions/builtin/task/task-store.d.ts +4 -0
  159. package/dist/extensions/builtin/task/task-store.js +1 -1
  160. package/dist/extensions/builtin/team/AGENT.md +112 -112
  161. package/dist/extensions/builtin/team/TESTING.md +299 -299
  162. package/dist/extensions/builtin/token-save/README.md +56 -56
  163. package/dist/extensions/optional/AGENT.md +10 -10
  164. package/dist/index.d.ts +5 -30
  165. package/dist/index.js +1 -1
  166. package/dist/models.d.ts +7 -0
  167. package/dist/models.js +1 -0
  168. package/dist/modes/interactive/components/footer.js +1 -1
  169. package/dist/modes/interactive/components/task-status-panel.d.ts +36 -0
  170. package/dist/modes/interactive/components/task-status-panel.js +1 -0
  171. package/dist/modes/interactive/controllers/stream-render-controller.d.ts +7 -0
  172. package/dist/modes/interactive/controllers/stream-render-controller.js +2 -2
  173. package/dist/modes/interactive/interactive-mode.js +40 -40
  174. package/dist/modes/interactive/state/interactive-state.d.ts +2 -0
  175. package/dist/modes/interactive/state/interactive-state.js +1 -1
  176. package/dist/modes/interactive/theme/dark.json +85 -85
  177. package/dist/modes/interactive/theme/light.json +84 -84
  178. package/dist/modes/interactive/theme/theme-schema.json +335 -335
  179. package/dist/modes/interactive/theme/warm.json +81 -81
  180. package/dist/node_modules/@pencil-agent/ai/dist/cli.js +0 -0
  181. package/dist/node_modules/@pencil-agent/ai/dist/models.generated.js +1 -1
  182. package/dist/node_modules/@pencil-agent/ai/dist/providers/anthropic.js +2 -2
  183. package/dist/node_modules/@pencil-agent/ai/dist/providers/openai-completions.js +5 -5
  184. package/dist/node_modules/@pencil-agent/ai/dist/providers/openai-responses.js +1 -1
  185. package/dist/node_modules/@pencil-agent/ai/dist/stream.js +1 -1
  186. package/dist/packages/protocol/src/commands.d.ts +33 -0
  187. package/dist/packages/protocol/src/flags.d.ts +20 -0
  188. package/dist/packages/protocol/src/hooks.d.ts +17 -0
  189. package/dist/packages/protocol/src/hooks.js +0 -0
  190. package/dist/packages/{extension-sdk → protocol}/src/index.d.ts +7 -4
  191. package/dist/packages/protocol/src/index.js +1 -0
  192. package/dist/packages/{extension-sdk → protocol}/src/lifecycle.d.ts +15 -27
  193. package/dist/packages/protocol/src/lifecycle.js +0 -0
  194. package/dist/packages/{extension-sdk → protocol}/src/tools.d.ts +1 -1
  195. package/dist/packages/protocol/src/tools.js +0 -0
  196. package/dist/public-config.d.ts +12 -0
  197. package/dist/public-config.js +1 -0
  198. package/dist/runtime.d.ts +9 -0
  199. package/dist/runtime.js +1 -0
  200. package/dist/session-compaction.d.ts +7 -0
  201. package/dist/session-compaction.js +1 -0
  202. package/dist/session.d.ts +7 -0
  203. package/dist/session.js +1 -0
  204. package/dist/skills.d.ts +7 -0
  205. package/dist/skills.js +1 -0
  206. package/dist/tools.d.ts +7 -0
  207. package/dist/tools.js +1 -0
  208. package/docs/ACP/345/215/217/350/256/256/351/233/206/346/210/220/345/274/200/345/217/221/346/226/207/346/241/243.md +851 -0
  209. package/docs/SDK-TESTING.md +364 -0
  210. package/docs/codex-goal-command-impl.md +1055 -1055
  211. package/docs/codex-goal-vs-grub.md +500 -500
  212. package/docs/custom-provider.md +27 -27
  213. package/docs/extensions.md +27 -27
  214. package/docs/keybindings.md +27 -27
  215. package/docs/loop /351/207/215/346/236/204/345/256/214/346/210/220/346/200/273/347/273/223.md" +250 -250
  216. package/docs/loop /351/207/215/346/236/204/345/256/214/346/210/220/346/212/245/345/221/212.md" +122 -122
  217. package/docs/loop /351/207/215/346/236/204/346/226/271/346/241/210.md" +1222 -1222
  218. package/docs/loop /351/207/215/346/236/204/346/226/271/346/241/210/345/256/236/347/216/260/346/212/245/345/221/212.md" +158 -158
  219. package/docs/loop /351/207/215/346/236/204/346/226/271/346/241/210/345/257/271/346/257/224/345/210/206/346/236/220.md" +128 -128
  220. package/docs/loop /351/207/215/346/236/204/350/256/241/345/210/222.md" +320 -320
  221. package/docs/loop-usage-examples.md +214 -214
  222. package/docs/mem-core/346/212/200/346/234/257/346/226/207/346/241/243.md +593 -0
  223. package/docs/models.md +27 -27
  224. package/docs/packages.md +27 -27
  225. package/docs/pi-design-philosophy.md +457 -457
  226. package/docs/planmode.md +1987 -1987
  227. package/docs/prompt-templates.md +27 -27
  228. package/docs/providers.md +27 -27
  229. package/docs/sdk.md +27 -27
  230. package/docs/skills.md +27 -27
  231. package/docs/startup-performance-optimization.md +301 -0
  232. package/docs/themes.md +27 -27
  233. package/docs/tui.md +27 -27
  234. package/docs//350/256/244/347/237/245/345/234/260/345/233/276.md +47 -0
  235. package/package.json +190 -162
  236. package/dist/packages/extension-sdk/src/index.js +0 -1
  237. package/docs/cc-agent-design.md +0 -1297
  238. package/docs/cc-tui-design.md +0 -1333
  239. package/docs//345/257/271/346/240/207Claude-Code.md +0 -1775
  240. /package/dist/packages/{extension-sdk/src/lifecycle.js → protocol/src/commands.js} +0 -0
  241. /package/dist/packages/{extension-sdk/src/tools.js → protocol/src/flags.js} +0 -0
@@ -1,444 +1,444 @@
1
- # Walmart — Product Search & Data Extraction
2
-
3
- Field-tested against walmart.com on 2026-04-18 using `http_get` (no browser required).
4
- All code blocks were run and outputs verified against live responses.
5
-
6
- ---
7
-
8
- ## Fastest Approach: `http_get` with `__NEXT_DATA__`
9
-
10
- Walmart's Next.js SSR embeds the full search or product payload as JSON in a
11
- `<script id="__NEXT_DATA__">` tag. **No browser needed for search or product detail pages.**
12
- ~2–3 s per page fetch; no CAPTCHA or session cookies required.
13
-
14
- ### Critical UA rule
15
-
16
- | User-Agent | Result |
17
- |---|---|
18
- | `Mozilla/5.0` (bare) | Full HTML + `__NEXT_DATA__` — **use this** |
19
- | `Mozilla/5.0 ... Chrome/120 ...` (full) | PerimeterX "Robot or human?" challenge (200, 15 KB) |
20
- | `Safari/17` full UA | Works (full HTML, ~1.15 MB) |
21
- | `curl/7.x` | PerimeterX challenge |
22
- | `python-requests/2.31` | PerimeterX challenge |
23
-
24
- The bare `Mozilla/5.0` string bypasses PerimeterX. Any UA that looks like a headless
25
- client or includes a recognizable browser fingerprint triggers the JS challenge page.
26
-
27
- ### Base fetch helper
28
-
29
- ```python
30
- import json, re, gzip, urllib.request
31
-
32
- def fetch_walmart(url):
33
- """
34
- Fetch any walmart.com page.
35
- Returns decoded HTML string.
36
- Raises RuntimeError if PerimeterX bot challenge is returned.
37
- """
38
- req = urllib.request.Request(
39
- url,
40
- headers={"User-Agent": "Mozilla/5.0", "Accept-Encoding": "gzip"},
41
- )
42
- with urllib.request.urlopen(req, timeout=20) as r:
43
- data = r.read()
44
- if r.headers.get("Content-Encoding") == "gzip":
45
- data = gzip.decompress(data)
46
- html = data.decode()
47
- if "Robot or human" in html:
48
- raise RuntimeError(f"PerimeterX challenge triggered: {url}")
49
- return html
50
-
51
- def parse_next_data(html):
52
- m = re.search(r'id="__NEXT_DATA__"[^>]*>(.*?)</script>', html, re.DOTALL)
53
- if not m:
54
- raise ValueError("__NEXT_DATA__ not found — page structure may have changed")
55
- return json.loads(m.group(1))
56
- ```
57
-
58
- ---
59
-
60
- ## Search Results
61
-
62
- ### URL patterns
63
-
64
- ```python
65
- # Keyword search
66
- "https://www.walmart.com/search?q=laptop"
67
-
68
- # Pagination — append &page=N
69
- "https://www.walmart.com/search?q=laptop&page=2"
70
-
71
- # Sort options (confirmed working)
72
- "https://www.walmart.com/search?q=laptop&sort=best_match" # default
73
- "https://www.walmart.com/search?q=laptop&sort=best_seller"
74
- "https://www.walmart.com/search?q=laptop&sort=price_low"
75
- "https://www.walmart.com/search?q=laptop&sort=customer_rating"
76
-
77
- # Price filter
78
- "https://www.walmart.com/search?q=laptop&min_price=200&max_price=500"
79
-
80
- # Browse by category (department ID path)
81
- "https://www.walmart.com/browse/electronics/laptops/3944_1089430_3951"
82
- ```
83
-
84
- ### `__NEXT_DATA__` path to items
85
-
86
- ```
87
- data
88
- .props.pageProps.initialData.searchResult
89
- .aggregatedCount — int: total matching products (e.g. 18818)
90
- .paginationV2.maxPage — int: last page number
91
- .itemStacks[] — array of stacks (usually 2: sponsored + organic)
92
- .items[] — array of product objects
93
- ```
94
-
95
- ### Full extractor (field-tested)
96
-
97
- ```python
98
- def extract_search_results(html):
99
- """
100
- Returns (items, total_count, max_page).
101
- items is a list of dicts with confirmed fields.
102
- """
103
- data = parse_next_data(html)
104
- sr = data["props"]["pageProps"]["initialData"]["searchResult"]
105
-
106
- items = []
107
- for stack in sr.get("itemStacks", []):
108
- for item in stack.get("items", []):
109
- pi = item.get("priceInfo") or {}
110
- img = item.get("imageInfo") or {}
111
- rating = item.get("rating") or {}
112
- avail = item.get("availabilityStatusV2") or {}
113
- items.append({
114
- "usItemId": item.get("usItemId"), # str, Walmart item ID
115
- "name": item.get("name"), # str
116
- "brand": item.get("brand"), # str or None
117
- "price": item.get("price"), # int, current price in USD
118
- "linePrice": pi.get("linePrice"), # str "$429.00"
119
- "wasPrice": pi.get("wasPrice") or None, # str "$699.00" or None
120
- "savings": pi.get("savings") or None, # str "SAVE $270.00" or None
121
- "averageRating": rating.get("averageRating"), # float e.g. 4.3
122
- "numberOfReviews": rating.get("numberOfReviews"), # int
123
- "availability": avail.get("value"), # "IN_STOCK" / "OUT_OF_STOCK"
124
- "isSponsored": bool(item.get("isSponsoredFlag")),
125
- "url": "https://www.walmart.com" + (item.get("canonicalUrl") or "").split("?")[0],
126
- "thumbnailUrl": img.get("thumbnailUrl"),
127
- })
128
-
129
- total = sr.get("aggregatedCount")
130
- max_page = (sr.get("paginationV2") or {}).get("maxPage")
131
- return items, total, max_page
132
-
133
-
134
- # Usage
135
- html = fetch_walmart("https://www.walmart.com/search?q=laptop")
136
- items, total, max_page = extract_search_results(html)
137
- # items: 66 items on page 1, total=18818, max_page=11
138
-
139
- # Filter out sponsored
140
- organic = [i for i in items if not i["isSponsored"]]
141
- ```
142
-
143
- ### Field notes (confirmed)
144
-
145
- - **`usItemId`**: string, matches the numeric ID at the end of `/ip/.../ITEMID` URLs.
146
- Some non-product rows (ad widgets) have `usItemId=None` — filter with `if item.get("usItemId")`.
147
- - **`price`**: integer cents-less price (e.g. `429` for "$429.00"). Use `priceInfo.linePrice` for
148
- the formatted string including the dollar sign.
149
- - **`wasPrice` / `savings`**: only present when item is on sale. Always `None` for full-price items.
150
- - **`isSponsoredFlag`**: the first batch of results across both itemStacks are frequently sponsored.
151
- On a laptop search, ~56 of 66 SSR items carry `isSponsoredFlag: true`.
152
- - **`rating`**: present on ~91% of items (60/66 in test). `averageRating` is a float; `numberOfReviews` is int.
153
- - **`canonicalUrl`**: always includes `?classType=...&athbdg=...` query params — strip with `.split("?")[0]`
154
- to get a clean URL.
155
- - **Two itemStacks**: Walmart returns two stacks (`itemStacks[0]` and `itemStacks[1]`). Merge them.
156
- `itemStacks[0]` is the primary grid; `itemStacks[1]` is a secondary sponsored/related block.
157
-
158
- ### Pagination
159
-
160
- ```python
161
- for page in range(1, max_page + 1):
162
- html = fetch_walmart(f"https://www.walmart.com/search?q=laptop&page={page}")
163
- items, _, _ = extract_search_results(html)
164
- # process items...
165
- ```
166
-
167
- Page responses average ~2.5 s each. No rate-limiting was observed across 3 sequential requests.
168
- For bulk scraping, add a 1–2 s delay between requests to be safe.
169
-
170
- ---
171
-
172
- ## Product Detail Page
173
-
174
- ### URL pattern
175
-
176
- ```
177
- https://www.walmart.com/ip/{slug}/{usItemId}
178
- ```
179
-
180
- The slug is ignored in routing — only the numeric `usItemId` matters.
181
- These work identically:
182
- ```
183
- https://www.walmart.com/ip/anything/19717318352
184
- https://www.walmart.com/ip/Apple-MacBook-Neo/19717318352
185
- ```
186
-
187
- ### `__NEXT_DATA__` path on a product page
188
-
189
- ```
190
- data.props.pageProps.initialData.data
191
- .product — core product object
192
- .idml — long description, specs, highlights, warranty
193
- .reviews — rating breakdown + first 10 customer reviews (SSR)
194
- ```
195
-
196
- ### Full extractor (field-tested)
197
-
198
- ```python
199
- def extract_product_detail(html):
200
- """
201
- Returns a dict with all confirmed product fields.
202
- idml.specifications returns all spec rows as a flat dict.
203
- reviews returns the SSR-rendered first 10 customer reviews.
204
- """
205
- data = parse_next_data(html)
206
- d = data["props"]["pageProps"]["initialData"]["data"]
207
- product = d["product"]
208
- idml = d.get("idml") or {}
209
- reviews = d.get("reviews") or {}
210
-
211
- pi = product.get("priceInfo") or {}
212
- cp = pi.get("currentPrice") or {}
213
- img = product.get("imageInfo") or {}
214
- avail = product.get("availabilityStatusV2") or {}
215
-
216
- specs = {
217
- spec.get("name"): spec.get("value")
218
- for spec in (idml.get("specifications") or [])
219
- }
220
-
221
- all_images = [
222
- img_item.get("url")
223
- for img_item in (img.get("allImages") or [])
224
- if img_item.get("url")
225
- ]
226
-
227
- customer_reviews = [
228
- {
229
- "title": r.get("reviewTitle"),
230
- "rating": r.get("rating"), # int 1-5 (field is "rating", NOT "overallRating")
231
- "text": r.get("reviewText"),
232
- "author": r.get("userNickname"),
233
- "date": r.get("reviewSubmissionTime"),
234
- }
235
- for r in (reviews.get("customerReviews") or [])
236
- ]
237
-
238
- return {
239
- # identity
240
- "usItemId": product.get("usItemId"),
241
- "name": product.get("name"),
242
- "brand": product.get("brand"),
243
- "model": product.get("model"),
244
- "upc": product.get("upc"),
245
- # price
246
- "price": cp.get("price"), # float, e.g. 599
247
- "priceString": cp.get("priceString"), # "$599.00"
248
- "wasPrice": (pi.get("wasPrice") or {}).get("priceString"),
249
- "savings": (pi.get("savings") or {}).get("savingsString"),
250
- # availability
251
- "availability": avail.get("value"), # "IN_STOCK" / "OUT_OF_STOCK"
252
- "availabilityDisplay": avail.get("display"), # "In stock"
253
- # ratings
254
- "averageRating": product.get("averageRating"),
255
- "numberOfReviews": product.get("numberOfReviews"),
256
- # text
257
- "shortDescription": product.get("shortDescription"),
258
- "longDescription": idml.get("longDescription"), # HTML string
259
- # media
260
- "thumbnailUrl": img.get("thumbnailUrl"),
261
- "allImages": all_images, # up to 10 image URLs
262
- # specs
263
- "specifications": specs, # {"Brand": "Apple", "Processor": "A18 Pro", ...}
264
- "highlights": [ # top highlighted specs with icons
265
- {"name": h.get("name"), "value": h.get("value")}
266
- for h in (idml.get("productHighlights") or [])
267
- ],
268
- # URL
269
- "canonicalUrl": "https://www.walmart.com" + (product.get("canonicalUrl") or ""),
270
- # fulfillment
271
- "fulfillmentOptions": product.get("fulfillmentOptions") or [],
272
- # reviews (SSR-rendered, first 10)
273
- "reviewSummary": {
274
- "averageOverallRating": reviews.get("averageOverallRating"),
275
- "totalReviewCount": reviews.get("totalReviewCount"),
276
- "reviewsWithTextCount": reviews.get("reviewsWithTextCount"),
277
- "recommendedPercentage": reviews.get("recommendedPercentage"),
278
- },
279
- "customerReviews": customer_reviews,
280
- }
281
-
282
-
283
- # Usage
284
- url = "https://www.walmart.com/ip/Apple-MacBook-Neo/19717318352"
285
- html = fetch_walmart(url)
286
- product = extract_product_detail(html)
287
-
288
- # Example output (confirmed live):
289
- # product["name"] → "Apple MacBook Neo 13-inch Apple A18 Pro chip..."
290
- # product["price"] → 599
291
- # product["priceString"] → "$599.00"
292
- # product["availability"] → "IN_STOCK"
293
- # product["model"] → "MHFD4LL/A"
294
- # product["upc"] → "195950852745"
295
- # len(product["specifications"]) → 29 spec rows
296
- # len(product["allImages"]) → 10
297
- # product["specifications"]["Processor"] → "A18 Pro"
298
- ```
299
-
300
- ### Field notes (confirmed)
301
-
302
- - **`averageRating` / `numberOfReviews`** on the product node: present for items with reviews.
303
- New/few-review items may return `None` for both.
304
- - **`reviewSummary.averageOverallRating`** in the reviews node often differs slightly from
305
- `product.averageRating` — the reviews node is more precise (e.g. `4.75` vs `4.8`).
306
- - **`customerReviews`** (SSR): always the first 10 reviews. The per-review rating field is `"rating"`
307
- (int 1–5), **not** `"overallRating"` (which is always `None`).
308
- - **`longDescription`**: raw HTML string including `<ul>/<li>` tags. Strip tags before display.
309
- - **`specifications`**: flat dict — confirmed 29–31 rows for electronics. Key names use display labels
310
- (e.g. `"RAM memory"`, `"Screen size"`, `"HD capacity"`).
311
- - **`wasPrice` / `savings`** on detail page: same as search — `None` when item is not discounted.
312
- - **No JSON-LD**: Walmart product pages do **not** include `<script type="application/ld+json">`.
313
- All structured data lives in `__NEXT_DATA__`.
314
-
315
- ---
316
-
317
- ## Anti-Bot: PerimeterX
318
-
319
- Walmart uses **PerimeterX** (app ID `PXu6b0qd2S`, confirmed in `runtimeConfig.perimeterX`).
320
-
321
- | Signal | Detail |
322
- |---|---|
323
- | Bot detector | PerimeterX |
324
- | Challenge page | "Robot or human?" — 200 OK, 15 KB HTML |
325
- | Triggered by | Full browser UA strings (Chrome, curl, python-requests) |
326
- | Bypassed by | `User-Agent: Mozilla/5.0` (bare prefix only) |
327
- | No JS execution | SSR response is complete — no JS challenge to solve |
328
-
329
- Detection in code:
330
- ```python
331
- if "Robot or human" in html:
332
- raise RuntimeError("PerimeterX challenge — switch to browser harness")
333
- ```
334
-
335
- If `http_get` starts returning the challenge after a run of successful fetches, switch to the
336
- browser harness (see below).
337
-
338
- ---
339
-
340
- ## Browser Harness Fallback
341
-
342
- Use the browser harness when:
343
- - PerimeterX starts blocking `http_get` on your IP
344
- - You need to interact with the page (add to cart, filter UI, infinite scroll)
345
- - You need variant switching (color/size selectors)
346
-
347
- ```python
348
- # Browser-based search extraction
349
- new_tab("https://www.walmart.com/search?q=laptop")
350
- wait_for_load()
351
- wait(2) # JS renders product cards after readyState=complete
352
-
353
- # Extract via __NEXT_DATA__ in-browser (identical structure to http_get)
354
- import json
355
- nd = js("document.getElementById('__NEXT_DATA__')?.textContent")
356
- data = json.loads(nd)
357
- sr = data["props"]["pageProps"]["initialData"]["searchResult"]
358
- items = []
359
- for stack in sr.get("itemStacks", []):
360
- items.extend(stack.get("items", []))
361
- ```
362
-
363
- ### Browser selectors (confirmed working for DOM-based extraction)
364
-
365
- ```python
366
- # Product cards on search results page
367
- results = js("""
368
- Array.from(document.querySelectorAll('[data-item-id]')).map(el => ({
369
- itemId: el.getAttribute('data-item-id'),
370
- name: el.querySelector('[itemprop="name"]')?.innerText?.trim(),
371
- price: el.querySelector('[itemprop="price"]')?.getAttribute('content'),
372
- url: el.querySelector('a[link-identifier]')?.href,
373
- })).filter(r => r.itemId)
374
- """)
375
-
376
- # If [data-item-id] misses items, use the Next.js data attribute alternative:
377
- results_alt = js("""
378
- Array.from(document.querySelectorAll('[data-testid="list-view"]'))
379
- .map(el => el.innerText.trim())
380
- """)
381
- ```
382
-
383
- > **Prefer `__NEXT_DATA__` over DOM selectors** even in-browser — the JSON is complete and
384
- > stable. DOM class names at Walmart are obfuscated and change between deployments.
385
-
386
- ### Session gotcha
387
-
388
- Always open Walmart with `new_tab()` on first visit:
389
- ```python
390
- new_tab("https://www.walmart.com/search?q=laptop")
391
- wait_for_load()
392
- wait(2)
393
- ```
394
- After that, `goto_url()` works normally within the same session.
395
-
396
- ---
397
-
398
- ## Public API
399
-
400
- Walmart's affiliate/partner API (`developer.api.walmart.com`) requires a registered API key
401
- and returns HTTP 403 without one. No unauthenticated public product API is available.
402
- The `__NEXT_DATA__` SSR approach replaces any need for the official API for read-only data.
403
-
404
- ---
405
-
406
- ## Gotchas
407
-
408
- - **UA must be `Mozilla/5.0` bare**: Any fuller string (Chrome, Safari, curl, requests) hits
409
- PerimeterX. This is counterintuitive — the *shorter*, less realistic UA is the one that works.
410
-
411
- - **Regex must use `id=` attribute match**: The regex
412
- `r'<script id="__NEXT_DATA__" type="application/json">...'` fails because the actual tag is
413
- `<script id="__NEXT_DATA__">` without `type`. Use:
414
- ```python
415
- re.search(r'id="__NEXT_DATA__"[^>]*>(.*?)</script>', html, re.DOTALL)
416
- ```
417
-
418
- - **`usItemId` can be `None`**: ~5/66 items on a page are non-product ad widgets with no `usItemId`.
419
- Always filter: `[i for i in items if i.get("usItemId")]`.
420
-
421
- - **Two `itemStacks`**: Walmart returns two stacks. Iterate over all stacks or you'll miss
422
- ~10 items from the second stack.
423
-
424
- - **`canonicalUrl` includes tracking params**: Always strip with `.split("?")[0]`.
425
-
426
- - **Review field is `"rating"` not `"overallRating"`**: Each `customerReviews` entry has a `"rating"`
427
- int field (1–5). The `"overallRating"` field is always `None`. Don't confuse with
428
- `product.averageRating` (the aggregate float).
429
-
430
- - **No JSON-LD on product pages**: Zero `<script type="application/ld+json">` tags were found.
431
- All structured data is in `__NEXT_DATA__`.
432
-
433
- - **`longDescription` is HTML**: Strip tags before text use. May contain promotional/financing copy
434
- mixed with real product description.
435
-
436
- - **Page sizes vary**: Page 1 returned 66 items across 2 stacks; page 2 returned 55.
437
- Do not assume a fixed items-per-page count.
438
-
439
- - **`http_get` default already sends `Mozilla/5.0`**: `helpers.http_get()` uses
440
- `"User-Agent": "Mozilla/5.0"` by default — no override needed when calling it directly.
441
- Only pass a custom `headers=` if you need to change something else.
442
-
443
- - **`developer.api.walmart.com`** returns HTTP 403 without an API key. Not usable for
444
- unauthenticated scraping.
1
+ # Walmart — Product Search & Data Extraction
2
+
3
+ Field-tested against walmart.com on 2026-04-18 using `http_get` (no browser required).
4
+ All code blocks were run and outputs verified against live responses.
5
+
6
+ ---
7
+
8
+ ## Fastest Approach: `http_get` with `__NEXT_DATA__`
9
+
10
+ Walmart's Next.js SSR embeds the full search or product payload as JSON in a
11
+ `<script id="__NEXT_DATA__">` tag. **No browser needed for search or product detail pages.**
12
+ ~2–3 s per page fetch; no CAPTCHA or session cookies required.
13
+
14
+ ### Critical UA rule
15
+
16
+ | User-Agent | Result |
17
+ |---|---|
18
+ | `Mozilla/5.0` (bare) | Full HTML + `__NEXT_DATA__` — **use this** |
19
+ | `Mozilla/5.0 ... Chrome/120 ...` (full) | PerimeterX "Robot or human?" challenge (200, 15 KB) |
20
+ | `Safari/17` full UA | Works (full HTML, ~1.15 MB) |
21
+ | `curl/7.x` | PerimeterX challenge |
22
+ | `python-requests/2.31` | PerimeterX challenge |
23
+
24
+ The bare `Mozilla/5.0` string bypasses PerimeterX. Any UA that looks like a headless
25
+ client or includes a recognizable browser fingerprint triggers the JS challenge page.
26
+
27
+ ### Base fetch helper
28
+
29
+ ```python
30
+ import json, re, gzip, urllib.request
31
+
32
+ def fetch_walmart(url):
33
+ """
34
+ Fetch any walmart.com page.
35
+ Returns decoded HTML string.
36
+ Raises RuntimeError if PerimeterX bot challenge is returned.
37
+ """
38
+ req = urllib.request.Request(
39
+ url,
40
+ headers={"User-Agent": "Mozilla/5.0", "Accept-Encoding": "gzip"},
41
+ )
42
+ with urllib.request.urlopen(req, timeout=20) as r:
43
+ data = r.read()
44
+ if r.headers.get("Content-Encoding") == "gzip":
45
+ data = gzip.decompress(data)
46
+ html = data.decode()
47
+ if "Robot or human" in html:
48
+ raise RuntimeError(f"PerimeterX challenge triggered: {url}")
49
+ return html
50
+
51
+ def parse_next_data(html):
52
+ m = re.search(r'id="__NEXT_DATA__"[^>]*>(.*?)</script>', html, re.DOTALL)
53
+ if not m:
54
+ raise ValueError("__NEXT_DATA__ not found — page structure may have changed")
55
+ return json.loads(m.group(1))
56
+ ```
57
+
58
+ ---
59
+
60
+ ## Search Results
61
+
62
+ ### URL patterns
63
+
64
+ ```python
65
+ # Keyword search
66
+ "https://www.walmart.com/search?q=laptop"
67
+
68
+ # Pagination — append &page=N
69
+ "https://www.walmart.com/search?q=laptop&page=2"
70
+
71
+ # Sort options (confirmed working)
72
+ "https://www.walmart.com/search?q=laptop&sort=best_match" # default
73
+ "https://www.walmart.com/search?q=laptop&sort=best_seller"
74
+ "https://www.walmart.com/search?q=laptop&sort=price_low"
75
+ "https://www.walmart.com/search?q=laptop&sort=customer_rating"
76
+
77
+ # Price filter
78
+ "https://www.walmart.com/search?q=laptop&min_price=200&max_price=500"
79
+
80
+ # Browse by category (department ID path)
81
+ "https://www.walmart.com/browse/electronics/laptops/3944_1089430_3951"
82
+ ```
83
+
84
+ ### `__NEXT_DATA__` path to items
85
+
86
+ ```
87
+ data
88
+ .props.pageProps.initialData.searchResult
89
+ .aggregatedCount — int: total matching products (e.g. 18818)
90
+ .paginationV2.maxPage — int: last page number
91
+ .itemStacks[] — array of stacks (usually 2: sponsored + organic)
92
+ .items[] — array of product objects
93
+ ```
94
+
95
+ ### Full extractor (field-tested)
96
+
97
+ ```python
98
+ def extract_search_results(html):
99
+ """
100
+ Returns (items, total_count, max_page).
101
+ items is a list of dicts with confirmed fields.
102
+ """
103
+ data = parse_next_data(html)
104
+ sr = data["props"]["pageProps"]["initialData"]["searchResult"]
105
+
106
+ items = []
107
+ for stack in sr.get("itemStacks", []):
108
+ for item in stack.get("items", []):
109
+ pi = item.get("priceInfo") or {}
110
+ img = item.get("imageInfo") or {}
111
+ rating = item.get("rating") or {}
112
+ avail = item.get("availabilityStatusV2") or {}
113
+ items.append({
114
+ "usItemId": item.get("usItemId"), # str, Walmart item ID
115
+ "name": item.get("name"), # str
116
+ "brand": item.get("brand"), # str or None
117
+ "price": item.get("price"), # int, current price in USD
118
+ "linePrice": pi.get("linePrice"), # str "$429.00"
119
+ "wasPrice": pi.get("wasPrice") or None, # str "$699.00" or None
120
+ "savings": pi.get("savings") or None, # str "SAVE $270.00" or None
121
+ "averageRating": rating.get("averageRating"), # float e.g. 4.3
122
+ "numberOfReviews": rating.get("numberOfReviews"), # int
123
+ "availability": avail.get("value"), # "IN_STOCK" / "OUT_OF_STOCK"
124
+ "isSponsored": bool(item.get("isSponsoredFlag")),
125
+ "url": "https://www.walmart.com" + (item.get("canonicalUrl") or "").split("?")[0],
126
+ "thumbnailUrl": img.get("thumbnailUrl"),
127
+ })
128
+
129
+ total = sr.get("aggregatedCount")
130
+ max_page = (sr.get("paginationV2") or {}).get("maxPage")
131
+ return items, total, max_page
132
+
133
+
134
+ # Usage
135
+ html = fetch_walmart("https://www.walmart.com/search?q=laptop")
136
+ items, total, max_page = extract_search_results(html)
137
+ # items: 66 items on page 1, total=18818, max_page=11
138
+
139
+ # Filter out sponsored
140
+ organic = [i for i in items if not i["isSponsored"]]
141
+ ```
142
+
143
+ ### Field notes (confirmed)
144
+
145
+ - **`usItemId`**: string, matches the numeric ID at the end of `/ip/.../ITEMID` URLs.
146
+ Some non-product rows (ad widgets) have `usItemId=None` — filter with `if item.get("usItemId")`.
147
+ - **`price`**: integer cents-less price (e.g. `429` for "$429.00"). Use `priceInfo.linePrice` for
148
+ the formatted string including the dollar sign.
149
+ - **`wasPrice` / `savings`**: only present when item is on sale. Always `None` for full-price items.
150
+ - **`isSponsoredFlag`**: the first batch of results across both itemStacks are frequently sponsored.
151
+ On a laptop search, ~56 of 66 SSR items carry `isSponsoredFlag: true`.
152
+ - **`rating`**: present on ~91% of items (60/66 in test). `averageRating` is a float; `numberOfReviews` is int.
153
+ - **`canonicalUrl`**: always includes `?classType=...&athbdg=...` query params — strip with `.split("?")[0]`
154
+ to get a clean URL.
155
+ - **Two itemStacks**: Walmart returns two stacks (`itemStacks[0]` and `itemStacks[1]`). Merge them.
156
+ `itemStacks[0]` is the primary grid; `itemStacks[1]` is a secondary sponsored/related block.
157
+
158
+ ### Pagination
159
+
160
+ ```python
161
+ for page in range(1, max_page + 1):
162
+ html = fetch_walmart(f"https://www.walmart.com/search?q=laptop&page={page}")
163
+ items, _, _ = extract_search_results(html)
164
+ # process items...
165
+ ```
166
+
167
+ Page responses average ~2.5 s each. No rate-limiting was observed across 3 sequential requests.
168
+ For bulk scraping, add a 1–2 s delay between requests to be safe.
169
+
170
+ ---
171
+
172
+ ## Product Detail Page
173
+
174
+ ### URL pattern
175
+
176
+ ```
177
+ https://www.walmart.com/ip/{slug}/{usItemId}
178
+ ```
179
+
180
+ The slug is ignored in routing — only the numeric `usItemId` matters.
181
+ These work identically:
182
+ ```
183
+ https://www.walmart.com/ip/anything/19717318352
184
+ https://www.walmart.com/ip/Apple-MacBook-Neo/19717318352
185
+ ```
186
+
187
+ ### `__NEXT_DATA__` path on a product page
188
+
189
+ ```
190
+ data.props.pageProps.initialData.data
191
+ .product — core product object
192
+ .idml — long description, specs, highlights, warranty
193
+ .reviews — rating breakdown + first 10 customer reviews (SSR)
194
+ ```
195
+
196
+ ### Full extractor (field-tested)
197
+
198
+ ```python
199
+ def extract_product_detail(html):
200
+ """
201
+ Returns a dict with all confirmed product fields.
202
+ idml.specifications returns all spec rows as a flat dict.
203
+ reviews returns the SSR-rendered first 10 customer reviews.
204
+ """
205
+ data = parse_next_data(html)
206
+ d = data["props"]["pageProps"]["initialData"]["data"]
207
+ product = d["product"]
208
+ idml = d.get("idml") or {}
209
+ reviews = d.get("reviews") or {}
210
+
211
+ pi = product.get("priceInfo") or {}
212
+ cp = pi.get("currentPrice") or {}
213
+ img = product.get("imageInfo") or {}
214
+ avail = product.get("availabilityStatusV2") or {}
215
+
216
+ specs = {
217
+ spec.get("name"): spec.get("value")
218
+ for spec in (idml.get("specifications") or [])
219
+ }
220
+
221
+ all_images = [
222
+ img_item.get("url")
223
+ for img_item in (img.get("allImages") or [])
224
+ if img_item.get("url")
225
+ ]
226
+
227
+ customer_reviews = [
228
+ {
229
+ "title": r.get("reviewTitle"),
230
+ "rating": r.get("rating"), # int 1-5 (field is "rating", NOT "overallRating")
231
+ "text": r.get("reviewText"),
232
+ "author": r.get("userNickname"),
233
+ "date": r.get("reviewSubmissionTime"),
234
+ }
235
+ for r in (reviews.get("customerReviews") or [])
236
+ ]
237
+
238
+ return {
239
+ # identity
240
+ "usItemId": product.get("usItemId"),
241
+ "name": product.get("name"),
242
+ "brand": product.get("brand"),
243
+ "model": product.get("model"),
244
+ "upc": product.get("upc"),
245
+ # price
246
+ "price": cp.get("price"), # float, e.g. 599
247
+ "priceString": cp.get("priceString"), # "$599.00"
248
+ "wasPrice": (pi.get("wasPrice") or {}).get("priceString"),
249
+ "savings": (pi.get("savings") or {}).get("savingsString"),
250
+ # availability
251
+ "availability": avail.get("value"), # "IN_STOCK" / "OUT_OF_STOCK"
252
+ "availabilityDisplay": avail.get("display"), # "In stock"
253
+ # ratings
254
+ "averageRating": product.get("averageRating"),
255
+ "numberOfReviews": product.get("numberOfReviews"),
256
+ # text
257
+ "shortDescription": product.get("shortDescription"),
258
+ "longDescription": idml.get("longDescription"), # HTML string
259
+ # media
260
+ "thumbnailUrl": img.get("thumbnailUrl"),
261
+ "allImages": all_images, # up to 10 image URLs
262
+ # specs
263
+ "specifications": specs, # {"Brand": "Apple", "Processor": "A18 Pro", ...}
264
+ "highlights": [ # top highlighted specs with icons
265
+ {"name": h.get("name"), "value": h.get("value")}
266
+ for h in (idml.get("productHighlights") or [])
267
+ ],
268
+ # URL
269
+ "canonicalUrl": "https://www.walmart.com" + (product.get("canonicalUrl") or ""),
270
+ # fulfillment
271
+ "fulfillmentOptions": product.get("fulfillmentOptions") or [],
272
+ # reviews (SSR-rendered, first 10)
273
+ "reviewSummary": {
274
+ "averageOverallRating": reviews.get("averageOverallRating"),
275
+ "totalReviewCount": reviews.get("totalReviewCount"),
276
+ "reviewsWithTextCount": reviews.get("reviewsWithTextCount"),
277
+ "recommendedPercentage": reviews.get("recommendedPercentage"),
278
+ },
279
+ "customerReviews": customer_reviews,
280
+ }
281
+
282
+
283
+ # Usage
284
+ url = "https://www.walmart.com/ip/Apple-MacBook-Neo/19717318352"
285
+ html = fetch_walmart(url)
286
+ product = extract_product_detail(html)
287
+
288
+ # Example output (confirmed live):
289
+ # product["name"] → "Apple MacBook Neo 13-inch Apple A18 Pro chip..."
290
+ # product["price"] → 599
291
+ # product["priceString"] → "$599.00"
292
+ # product["availability"] → "IN_STOCK"
293
+ # product["model"] → "MHFD4LL/A"
294
+ # product["upc"] → "195950852745"
295
+ # len(product["specifications"]) → 29 spec rows
296
+ # len(product["allImages"]) → 10
297
+ # product["specifications"]["Processor"] → "A18 Pro"
298
+ ```
299
+
300
+ ### Field notes (confirmed)
301
+
302
+ - **`averageRating` / `numberOfReviews`** on the product node: present for items with reviews.
303
+ New/few-review items may return `None` for both.
304
+ - **`reviewSummary.averageOverallRating`** in the reviews node often differs slightly from
305
+ `product.averageRating` — the reviews node is more precise (e.g. `4.75` vs `4.8`).
306
+ - **`customerReviews`** (SSR): always the first 10 reviews. The per-review rating field is `"rating"`
307
+ (int 1–5), **not** `"overallRating"` (which is always `None`).
308
+ - **`longDescription`**: raw HTML string including `<ul>/<li>` tags. Strip tags before display.
309
+ - **`specifications`**: flat dict — confirmed 29–31 rows for electronics. Key names use display labels
310
+ (e.g. `"RAM memory"`, `"Screen size"`, `"HD capacity"`).
311
+ - **`wasPrice` / `savings`** on detail page: same as search — `None` when item is not discounted.
312
+ - **No JSON-LD**: Walmart product pages do **not** include `<script type="application/ld+json">`.
313
+ All structured data lives in `__NEXT_DATA__`.
314
+
315
+ ---
316
+
317
+ ## Anti-Bot: PerimeterX
318
+
319
+ Walmart uses **PerimeterX** (app ID `PXu6b0qd2S`, confirmed in `runtimeConfig.perimeterX`).
320
+
321
+ | Signal | Detail |
322
+ |---|---|
323
+ | Bot detector | PerimeterX |
324
+ | Challenge page | "Robot or human?" — 200 OK, 15 KB HTML |
325
+ | Triggered by | Full browser UA strings (Chrome, curl, python-requests) |
326
+ | Bypassed by | `User-Agent: Mozilla/5.0` (bare prefix only) |
327
+ | No JS execution | SSR response is complete — no JS challenge to solve |
328
+
329
+ Detection in code:
330
+ ```python
331
+ if "Robot or human" in html:
332
+ raise RuntimeError("PerimeterX challenge — switch to browser harness")
333
+ ```
334
+
335
+ If `http_get` starts returning the challenge after a run of successful fetches, switch to the
336
+ browser harness (see below).
337
+
338
+ ---
339
+
340
+ ## Browser Harness Fallback
341
+
342
+ Use the browser harness when:
343
+ - PerimeterX starts blocking `http_get` on your IP
344
+ - You need to interact with the page (add to cart, filter UI, infinite scroll)
345
+ - You need variant switching (color/size selectors)
346
+
347
+ ```python
348
+ # Browser-based search extraction
349
+ new_tab("https://www.walmart.com/search?q=laptop")
350
+ wait_for_load()
351
+ wait(2) # JS renders product cards after readyState=complete
352
+
353
+ # Extract via __NEXT_DATA__ in-browser (identical structure to http_get)
354
+ import json
355
+ nd = js("document.getElementById('__NEXT_DATA__')?.textContent")
356
+ data = json.loads(nd)
357
+ sr = data["props"]["pageProps"]["initialData"]["searchResult"]
358
+ items = []
359
+ for stack in sr.get("itemStacks", []):
360
+ items.extend(stack.get("items", []))
361
+ ```
362
+
363
+ ### Browser selectors (confirmed working for DOM-based extraction)
364
+
365
+ ```python
366
+ # Product cards on search results page
367
+ results = js("""
368
+ Array.from(document.querySelectorAll('[data-item-id]')).map(el => ({
369
+ itemId: el.getAttribute('data-item-id'),
370
+ name: el.querySelector('[itemprop="name"]')?.innerText?.trim(),
371
+ price: el.querySelector('[itemprop="price"]')?.getAttribute('content'),
372
+ url: el.querySelector('a[link-identifier]')?.href,
373
+ })).filter(r => r.itemId)
374
+ """)
375
+
376
+ # If [data-item-id] misses items, use the Next.js data attribute alternative:
377
+ results_alt = js("""
378
+ Array.from(document.querySelectorAll('[data-testid="list-view"]'))
379
+ .map(el => el.innerText.trim())
380
+ """)
381
+ ```
382
+
383
+ > **Prefer `__NEXT_DATA__` over DOM selectors** even in-browser — the JSON is complete and
384
+ > stable. DOM class names at Walmart are obfuscated and change between deployments.
385
+
386
+ ### Session gotcha
387
+
388
+ Always open Walmart with `new_tab()` on first visit:
389
+ ```python
390
+ new_tab("https://www.walmart.com/search?q=laptop")
391
+ wait_for_load()
392
+ wait(2)
393
+ ```
394
+ After that, `goto_url()` works normally within the same session.
395
+
396
+ ---
397
+
398
+ ## Public API
399
+
400
+ Walmart's affiliate/partner API (`developer.api.walmart.com`) requires a registered API key
401
+ and returns HTTP 403 without one. No unauthenticated public product API is available.
402
+ The `__NEXT_DATA__` SSR approach replaces any need for the official API for read-only data.
403
+
404
+ ---
405
+
406
+ ## Gotchas
407
+
408
+ - **UA must be `Mozilla/5.0` bare**: Any fuller string (Chrome, Safari, curl, requests) hits
409
+ PerimeterX. This is counterintuitive — the *shorter*, less realistic UA is the one that works.
410
+
411
+ - **Regex must use `id=` attribute match**: The regex
412
+ `r'<script id="__NEXT_DATA__" type="application/json">...'` fails because the actual tag is
413
+ `<script id="__NEXT_DATA__">` without `type`. Use:
414
+ ```python
415
+ re.search(r'id="__NEXT_DATA__"[^>]*>(.*?)</script>', html, re.DOTALL)
416
+ ```
417
+
418
+ - **`usItemId` can be `None`**: ~5/66 items on a page are non-product ad widgets with no `usItemId`.
419
+ Always filter: `[i for i in items if i.get("usItemId")]`.
420
+
421
+ - **Two `itemStacks`**: Walmart returns two stacks. Iterate over all stacks or you'll miss
422
+ ~10 items from the second stack.
423
+
424
+ - **`canonicalUrl` includes tracking params**: Always strip with `.split("?")[0]`.
425
+
426
+ - **Review field is `"rating"` not `"overallRating"`**: Each `customerReviews` entry has a `"rating"`
427
+ int field (1–5). The `"overallRating"` field is always `None`. Don't confuse with
428
+ `product.averageRating` (the aggregate float).
429
+
430
+ - **No JSON-LD on product pages**: Zero `<script type="application/ld+json">` tags were found.
431
+ All structured data is in `__NEXT_DATA__`.
432
+
433
+ - **`longDescription` is HTML**: Strip tags before text use. May contain promotional/financing copy
434
+ mixed with real product description.
435
+
436
+ - **Page sizes vary**: Page 1 returned 66 items across 2 stacks; page 2 returned 55.
437
+ Do not assume a fixed items-per-page count.
438
+
439
+ - **`http_get` default already sends `Mozilla/5.0`**: `helpers.http_get()` uses
440
+ `"User-Agent": "Mozilla/5.0"` by default — no override needed when calling it directly.
441
+ Only pass a custom `headers=` if you need to change something else.
442
+
443
+ - **`developer.api.walmart.com`** returns HTTP 403 without an API key. Not usable for
444
+ unauthenticated scraping.