@pencil-agent/nano-pencil 2.0.1 → 2.0.3
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- package/README.md +267 -267
- package/dist/build-meta.json +3 -3
- package/dist/core/export-html/AGENT.md +11 -11
- package/dist/core/export-html/template.css +971 -971
- package/dist/core/export-html/template.html +54 -54
- package/dist/core/model/custom-providers.js +1 -1
- package/dist/core/model-registry.js +5 -5
- package/dist/extensions/builtin/AGENT.md +115 -115
- package/dist/extensions/builtin/browser/AGENT.md +17 -17
- package/dist/extensions/builtin/browser/agent-workspace/agent_helpers.py +12 -12
- package/dist/extensions/builtin/browser/agent-workspace/domain-skills/amazon/product-search.md +198 -198
- package/dist/extensions/builtin/browser/agent-workspace/domain-skills/archive-org/scraping.md +341 -341
- package/dist/extensions/builtin/browser/agent-workspace/domain-skills/arxiv/scraping.md +311 -311
- package/dist/extensions/builtin/browser/agent-workspace/domain-skills/arxiv-bulk/scraping.md +333 -333
- package/dist/extensions/builtin/browser/agent-workspace/domain-skills/atlas/overview.md +70 -70
- package/dist/extensions/builtin/browser/agent-workspace/domain-skills/booking-com/scraping.md +578 -578
- package/dist/extensions/builtin/browser/agent-workspace/domain-skills/capterra/scraping.md +440 -440
- package/dist/extensions/builtin/browser/agent-workspace/domain-skills/centilebrain/generate-estimates.md +110 -110
- package/dist/extensions/builtin/browser/agent-workspace/domain-skills/coingecko/scraping.md +325 -325
- package/dist/extensions/builtin/browser/agent-workspace/domain-skills/coinmarketcap/scraping.md +463 -463
- package/dist/extensions/builtin/browser/agent-workspace/domain-skills/coursera/scraping.md +360 -360
- package/dist/extensions/builtin/browser/agent-workspace/domain-skills/craigslist/scraping.md +390 -390
- package/dist/extensions/builtin/browser/agent-workspace/domain-skills/crossref/scraping.md +568 -568
- package/dist/extensions/builtin/browser/agent-workspace/domain-skills/dev-to/scraping.md +323 -323
- package/dist/extensions/builtin/browser/agent-workspace/domain-skills/duckduckgo/scraping.md +349 -349
- package/dist/extensions/builtin/browser/agent-workspace/domain-skills/ebay/scraping.md +435 -435
- package/dist/extensions/builtin/browser/agent-workspace/domain-skills/etsy/scraping.md +506 -506
- package/dist/extensions/builtin/browser/agent-workspace/domain-skills/eventbrite/scraping.md +363 -363
- package/dist/extensions/builtin/browser/agent-workspace/domain-skills/expedia/automation.md +168 -168
- package/dist/extensions/builtin/browser/agent-workspace/domain-skills/facebook/groups.md +236 -236
- package/dist/extensions/builtin/browser/agent-workspace/domain-skills/facebook/pages.md +295 -295
- package/dist/extensions/builtin/browser/agent-workspace/domain-skills/framer/editor.md +108 -108
- package/dist/extensions/builtin/browser/agent-workspace/domain-skills/fred/scraping.md +493 -493
- package/dist/extensions/builtin/browser/agent-workspace/domain-skills/g2/scraping.md +580 -580
- package/dist/extensions/builtin/browser/agent-workspace/domain-skills/genius/scraping.md +511 -511
- package/dist/extensions/builtin/browser/agent-workspace/domain-skills/github/repo-actions.md +65 -65
- package/dist/extensions/builtin/browser/agent-workspace/domain-skills/github/scraping.md +184 -184
- package/dist/extensions/builtin/browser/agent-workspace/domain-skills/glassdoor/scraping.md +543 -543
- package/dist/extensions/builtin/browser/agent-workspace/domain-skills/gmail/compose.md +122 -122
- package/dist/extensions/builtin/browser/agent-workspace/domain-skills/goodreads/scraping.md +461 -461
- package/dist/extensions/builtin/browser/agent-workspace/domain-skills/gutenberg/scraping.md +383 -383
- package/dist/extensions/builtin/browser/agent-workspace/domain-skills/hackernews/scraping.md +243 -243
- package/dist/extensions/builtin/browser/agent-workspace/domain-skills/howlongtobeat/scraping.md +473 -473
- package/dist/extensions/builtin/browser/agent-workspace/domain-skills/imdb/scraping.md +271 -271
- package/dist/extensions/builtin/browser/agent-workspace/domain-skills/itch-io/scraping.md +436 -436
- package/dist/extensions/builtin/browser/agent-workspace/domain-skills/job-boards/indeed-glassdoor.md +1021 -1021
- package/dist/extensions/builtin/browser/agent-workspace/domain-skills/letterboxd/scraping.md +349 -349
- package/dist/extensions/builtin/browser/agent-workspace/domain-skills/linkedin/invitation-manager.md +109 -109
- package/dist/extensions/builtin/browser/agent-workspace/domain-skills/loom/folder-enumeration.md +170 -170
- package/dist/extensions/builtin/browser/agent-workspace/domain-skills/macrotrends/scraping.md +537 -537
- package/dist/extensions/builtin/browser/agent-workspace/domain-skills/medium/article-hydration.md +120 -120
- package/dist/extensions/builtin/browser/agent-workspace/domain-skills/medium/scraping.md +414 -414
- package/dist/extensions/builtin/browser/agent-workspace/domain-skills/metacritic/scraping.md +477 -477
- package/dist/extensions/builtin/browser/agent-workspace/domain-skills/musicbrainz/scraping.md +478 -478
- package/dist/extensions/builtin/browser/agent-workspace/domain-skills/nasa/scraping.md +339 -339
- package/dist/extensions/builtin/browser/agent-workspace/domain-skills/news-aggregation/multi-source.md +205 -205
- package/dist/extensions/builtin/browser/agent-workspace/domain-skills/open-library/scraping.md +472 -472
- package/dist/extensions/builtin/browser/agent-workspace/domain-skills/openalex/scraping.md +470 -470
- package/dist/extensions/builtin/browser/agent-workspace/domain-skills/openstreetmap/scraping.md +490 -490
- package/dist/extensions/builtin/browser/agent-workspace/domain-skills/package-registries/npm-pypi.md +478 -478
- package/dist/extensions/builtin/browser/agent-workspace/domain-skills/polymarket/scraping.md +234 -234
- package/dist/extensions/builtin/browser/agent-workspace/domain-skills/producthunt/scraping.md +307 -307
- package/dist/extensions/builtin/browser/agent-workspace/domain-skills/pubmed/scraping.md +421 -421
- package/dist/extensions/builtin/browser/agent-workspace/domain-skills/quora/scraping.md +364 -364
- package/dist/extensions/builtin/browser/agent-workspace/domain-skills/rawg/scraping.md +352 -352
- package/dist/extensions/builtin/browser/agent-workspace/domain-skills/reddit/scraping.md +124 -124
- package/dist/extensions/builtin/browser/agent-workspace/domain-skills/rest-countries/scraping.md +233 -233
- package/dist/extensions/builtin/browser/agent-workspace/domain-skills/sec-edgar/scraping.md +361 -361
- package/dist/extensions/builtin/browser/agent-workspace/domain-skills/shopify-admin/README.md +36 -36
- package/dist/extensions/builtin/browser/agent-workspace/domain-skills/shopify-admin/embedded-apps.md +72 -72
- package/dist/extensions/builtin/browser/agent-workspace/domain-skills/shopify-admin/knowledge-base.md +109 -109
- package/dist/extensions/builtin/browser/agent-workspace/domain-skills/shopify-admin/polaris-inputs.md +137 -137
- package/dist/extensions/builtin/browser/agent-workspace/domain-skills/soundcloud/scraping.md +362 -362
- package/dist/extensions/builtin/browser/agent-workspace/domain-skills/spotify/scraping.md +339 -339
- package/dist/extensions/builtin/browser/agent-workspace/domain-skills/stackoverflow/scraping.md +435 -435
- package/dist/extensions/builtin/browser/agent-workspace/domain-skills/steam/scraping.md +575 -575
- package/dist/extensions/builtin/browser/agent-workspace/domain-skills/substack/scraping.md +338 -338
- package/dist/extensions/builtin/browser/agent-workspace/domain-skills/thetechgeeks/pricing.md +52 -52
- package/dist/extensions/builtin/browser/agent-workspace/domain-skills/tiktok/upload.md +107 -107
- package/dist/extensions/builtin/browser/agent-workspace/domain-skills/tradingview/scraping.md +309 -309
- package/dist/extensions/builtin/browser/agent-workspace/domain-skills/trello/boards-and-lists.md +88 -88
- package/dist/extensions/builtin/browser/agent-workspace/domain-skills/trustpilot/scraping.md +375 -375
- package/dist/extensions/builtin/browser/agent-workspace/domain-skills/walmart/scraping.md +444 -444
- package/dist/extensions/builtin/browser/agent-workspace/domain-skills/wayback-machine/scraping.md +306 -306
- package/dist/extensions/builtin/browser/agent-workspace/domain-skills/weather/scraping.md +398 -398
- package/dist/extensions/builtin/browser/agent-workspace/domain-skills/wellfound/scraping.md +596 -596
- package/dist/extensions/builtin/browser/agent-workspace/domain-skills/world-bank/scraping.md +356 -356
- package/dist/extensions/builtin/browser/agent-workspace/domain-skills/xiaohongshu/scraping.md +84 -84
- package/dist/extensions/builtin/browser/agent-workspace/domain-skills/youtube/scraping.md +418 -418
- package/dist/extensions/builtin/browser/agent-workspace/domain-skills/zillow/scraping.md +433 -433
- package/dist/extensions/builtin/browser/browser.md +73 -73
- package/dist/extensions/builtin/browser/install.md +142 -142
- package/dist/extensions/builtin/browser/interaction-skills/connection.md +48 -48
- package/dist/extensions/builtin/browser/interaction-skills/cookies.md +3 -3
- package/dist/extensions/builtin/browser/interaction-skills/cross-origin-iframes.md +3 -3
- package/dist/extensions/builtin/browser/interaction-skills/dialogs.md +64 -64
- package/dist/extensions/builtin/browser/interaction-skills/downloads.md +3 -3
- package/dist/extensions/builtin/browser/interaction-skills/drag-and-drop.md +3 -3
- package/dist/extensions/builtin/browser/interaction-skills/dropdowns.md +3 -3
- package/dist/extensions/builtin/browser/interaction-skills/iframes.md +3 -3
- package/dist/extensions/builtin/browser/interaction-skills/network-requests.md +3 -3
- package/dist/extensions/builtin/browser/interaction-skills/print-as-pdf.md +3 -3
- package/dist/extensions/builtin/browser/interaction-skills/profile-sync.md +90 -90
- package/dist/extensions/builtin/browser/interaction-skills/screenshots.md +17 -17
- package/dist/extensions/builtin/browser/interaction-skills/scrolling.md +3 -3
- package/dist/extensions/builtin/browser/interaction-skills/shadow-dom.md +3 -3
- package/dist/extensions/builtin/browser/interaction-skills/tabs.md +69 -69
- package/dist/extensions/builtin/browser/interaction-skills/uploads.md +1 -1
- package/dist/extensions/builtin/browser/interaction-skills/viewport.md +3 -3
- package/dist/extensions/builtin/browser/src/browser_harness/AGENT.md +15 -15
- package/dist/extensions/builtin/browser/src/browser_harness/__init__.py +8 -8
- package/dist/extensions/builtin/browser/src/browser_harness/_ipc.py +90 -90
- package/dist/extensions/builtin/browser/src/browser_harness/admin.py +722 -722
- package/dist/extensions/builtin/browser/src/browser_harness/daemon.py +328 -328
- package/dist/extensions/builtin/browser/src/browser_harness/helpers.py +396 -396
- package/dist/extensions/builtin/browser/src/browser_harness/run.py +103 -103
- package/dist/extensions/builtin/debug/index.js +9 -9
- package/dist/extensions/builtin/discipline/skills/brainstorming/SKILL.md +33 -33
- package/dist/extensions/builtin/discipline/skills/executing-plans/SKILL.md +25 -25
- package/dist/extensions/builtin/discipline/skills/finishing-development-branch/SKILL.md +25 -25
- package/dist/extensions/builtin/discipline/skills/receiving-code-review/SKILL.md +22 -22
- package/dist/extensions/builtin/discipline/skills/requesting-code-review/SKILL.md +31 -31
- package/dist/extensions/builtin/discipline/skills/systematic-debugging/SKILL.md +28 -28
- package/dist/extensions/builtin/discipline/skills/test-driven-development/SKILL.md +32 -32
- package/dist/extensions/builtin/discipline/skills/using-git-worktrees/SKILL.md +25 -25
- package/dist/extensions/builtin/discipline/skills/verification-before-completion/SKILL.md +27 -27
- package/dist/extensions/builtin/discipline/skills/writing-plans/SKILL.md +26 -26
- package/dist/extensions/builtin/goal/README.md +67 -67
- package/dist/extensions/builtin/goal/index.js +6 -6
- package/dist/extensions/builtin/grub/README.md +112 -112
- package/dist/extensions/builtin/link-world/agent-workspace/README.md +16 -16
- package/dist/extensions/builtin/link-world/internet-search/internet-search.md +65 -65
- package/dist/extensions/builtin/link-world/link-world-agent.md +82 -82
- package/dist/extensions/builtin/link-world/linkworld.md +313 -313
- package/dist/extensions/builtin/link-world/network-routing/network-routing.md +67 -67
- package/dist/extensions/builtin/loop/README.md +92 -92
- package/dist/extensions/builtin/mcp/figma-design.md +68 -68
- package/dist/extensions/builtin/mcp/mcp-management.md +85 -85
- package/dist/extensions/builtin/recap/AGENT.md +15 -15
- package/dist/extensions/builtin/sal/README.md +72 -72
- package/dist/extensions/builtin/security-audit/README.md +289 -289
- package/dist/extensions/builtin/team/AGENT.md +112 -112
- package/dist/extensions/builtin/team/TESTING.md +299 -299
- package/dist/extensions/builtin/token-save/README.md +56 -56
- package/dist/extensions/optional/AGENT.md +10 -10
- package/dist/modes/interactive/controllers/input-submit-controller.js +2 -2
- package/dist/modes/interactive/controllers/stream-render-controller.js +2 -2
- package/dist/modes/interactive/interactive-mode.js +19 -19
- package/dist/modes/interactive/theme/dark.json +85 -85
- package/dist/modes/interactive/theme/light.json +84 -84
- package/dist/modes/interactive/theme/theme-schema.json +335 -335
- package/dist/modes/interactive/theme/warm.json +81 -81
- package/dist/node_modules/@pencil-agent/ai/dist/cli.js +0 -0
- package/dist/node_modules/@pencil-agent/ai/dist/models.generated.js +1 -1
- package/docs/ACP/345/215/217/350/256/256/351/233/206/346/210/220/345/274/200/345/217/221/346/226/207/346/241/243.md +851 -0
- package/docs/SDK-TESTING.md +364 -0
- package/docs/codex-goal-command-impl.md +1055 -1055
- package/docs/codex-goal-vs-grub.md +500 -500
- package/docs/custom-provider.md +27 -27
- package/docs/extensions.md +27 -27
- package/docs/keybindings.md +27 -27
- package/docs/loop /351/207/215/346/236/204/345/256/214/346/210/220/346/200/273/347/273/223.md" +250 -250
- package/docs/loop /351/207/215/346/236/204/345/256/214/346/210/220/346/212/245/345/221/212.md" +122 -122
- package/docs/loop /351/207/215/346/236/204/346/226/271/346/241/210.md" +1222 -1222
- package/docs/loop /351/207/215/346/236/204/346/226/271/346/241/210/345/256/236/347/216/260/346/212/245/345/221/212.md" +158 -158
- package/docs/loop /351/207/215/346/236/204/346/226/271/346/241/210/345/257/271/346/257/224/345/210/206/346/236/220.md" +128 -128
- package/docs/loop /351/207/215/346/236/204/350/256/241/345/210/222.md" +320 -320
- package/docs/loop-usage-examples.md +214 -214
- package/docs/mem-core/346/212/200/346/234/257/346/226/207/346/241/243.md +593 -0
- package/docs/models.md +27 -27
- package/docs/packages.md +27 -27
- package/docs/pi-design-philosophy.md +457 -457
- package/docs/planmode.md +1987 -1987
- package/docs/prompt-templates.md +27 -27
- package/docs/providers.md +27 -27
- package/docs/sdk.md +27 -27
- package/docs/skills.md +27 -27
- package/docs/startup-performance-optimization.md +301 -0
- package/docs/themes.md +27 -27
- package/docs/tui.md +27 -27
- package/docs//350/256/244/347/237/245/345/234/260/345/233/276.md +47 -0
- package/package.json +190 -190
- package/docs/cc-agent-design.md +0 -1297
- package/docs/cc-tui-design.md +0 -1333
- package/docs/nanoPencil-/345/255/246/344/271/240/350/256/241/345/210/222.md +0 -170
- package/docs/scan-report.md +0 -3820
- package/docs//345/257/271/346/240/207Claude-Code.md +0 -1775
- package/docs//351/230/277/351/207/214/345/267/264/345/267/264/350/264/242/346/212/245/345/210/206/346/236/220/344/271/246.md +0 -261
package/dist/extensions/builtin/browser/agent-workspace/domain-skills/amazon/product-search.md
CHANGED
|
@@ -1,198 +1,198 @@
|
|
|
1
|
-
# Amazon — Product Search & Data Extraction
|
|
2
|
-
|
|
3
|
-
Field-tested against amazon.com on 2025-04-18 using a logged-in Chrome session.
|
|
4
|
-
No CAPTCHA or bot detection was triggered during any test run.
|
|
5
|
-
|
|
6
|
-
## Navigation
|
|
7
|
-
|
|
8
|
-
### Direct search URL (fastest, always use this)
|
|
9
|
-
```python
|
|
10
|
-
goto_url("https://www.amazon.com/s?k=mechanical+keyboard")
|
|
11
|
-
wait_for_load()
|
|
12
|
-
wait(2) # dynamic content needs ~2s after readyState=complete
|
|
13
|
-
```
|
|
14
|
-
|
|
15
|
-
### Search box typing (use when you need category filtering)
|
|
16
|
-
```python
|
|
17
|
-
goto_url("https://www.amazon.com")
|
|
18
|
-
wait_for_load()
|
|
19
|
-
wait(1)
|
|
20
|
-
js("document.querySelector('#twotabsearchtextbox').focus()")
|
|
21
|
-
js("document.querySelector('#twotabsearchtextbox').click()")
|
|
22
|
-
wait(0.3)
|
|
23
|
-
type_text("wireless mouse")
|
|
24
|
-
wait(0.3)
|
|
25
|
-
press_key("Enter")
|
|
26
|
-
wait_for_load()
|
|
27
|
-
wait(2)
|
|
28
|
-
```
|
|
29
|
-
|
|
30
|
-
### Direct product page
|
|
31
|
-
```python
|
|
32
|
-
# URL pattern: /dp/{ASIN} or /dp/{ASIN}?th=1 (Amazon may redirect to add ?th=1)
|
|
33
|
-
goto_url("https://www.amazon.com/dp/B08Z6X4NK3")
|
|
34
|
-
wait_for_load()
|
|
35
|
-
wait(2)
|
|
36
|
-
```
|
|
37
|
-
|
|
38
|
-
## Session Gotcha
|
|
39
|
-
|
|
40
|
-
**Always use `new_tab()` when opening Amazon for the first time in a harness session.**
|
|
41
|
-
`goto_url()` can silently fail to navigate if the current tab resists the navigation
|
|
42
|
-
(observed when the daemon attached to a different real tab). The safe pattern:
|
|
43
|
-
|
|
44
|
-
```python
|
|
45
|
-
tid = new_tab("https://www.amazon.com/s?k=mechanical+keyboard")
|
|
46
|
-
wait_for_load()
|
|
47
|
-
wait(2)
|
|
48
|
-
```
|
|
49
|
-
|
|
50
|
-
After that, `goto_url()` works fine within the same Amazon session.
|
|
51
|
-
|
|
52
|
-
## Search Results Extraction
|
|
53
|
-
|
|
54
|
-
### Container selector
|
|
55
|
-
`[data-component-type="s-search-result"]` — confirmed working, yields ~22 results per page.
|
|
56
|
-
|
|
57
|
-
### Full extraction (field-tested)
|
|
58
|
-
```python
|
|
59
|
-
results = js("""
|
|
60
|
-
Array.from(document.querySelectorAll('[data-component-type="s-search-result"]')).map(el => ({
|
|
61
|
-
asin: el.getAttribute('data-asin'),
|
|
62
|
-
title: el.querySelector('h2 span')?.innerText?.trim(),
|
|
63
|
-
price: el.querySelector('.a-price .a-offscreen')?.innerText,
|
|
64
|
-
list_price: el.querySelector('.a-text-price .a-offscreen')?.innerText,
|
|
65
|
-
rating: el.querySelector('[aria-label*="out of 5 stars"]')?.getAttribute('aria-label')?.split(' ')[0],
|
|
66
|
-
reviews: el.querySelector('[aria-label*="ratings"]')?.getAttribute('aria-label'),
|
|
67
|
-
is_sponsored: !!el.querySelector('.puis-sponsored-label-text'),
|
|
68
|
-
url: el.querySelector('h2 a')?.href
|
|
69
|
-
}))
|
|
70
|
-
""")
|
|
71
|
-
```
|
|
72
|
-
|
|
73
|
-
### Field notes
|
|
74
|
-
- **`asin`**: `data-asin` attribute on the container div — always present, matches the `/dp/{ASIN}` URL.
|
|
75
|
-
- **`title`**: `h2 span` works consistently. `h2 a.a-link-normal span` also works.
|
|
76
|
-
- **`price`**: `.a-price .a-offscreen` returns the formatted string e.g. `"$69.99"`. Use this, not `.a-price-whole`.
|
|
77
|
-
- **`list_price`**: `.a-text-price .a-offscreen` — only present when item is on sale (was/now pricing).
|
|
78
|
-
- **`rating`**: Use `aria-label` on `[aria-label*="out of 5 stars"]` — gives `"4.5 out of 5 stars, rating details"`, split on space for the number.
|
|
79
|
-
- **`reviews`**: Use `[aria-label*="ratings"]` attribute — gives `"1,514 ratings"`. Do NOT use `.a-size-base.s-underline-text` — that element exists on sponsored results and shows "Xbox" (a cross-sell widget text).
|
|
80
|
-
- **`is_sponsored`**: `.puis-sponsored-label-text` is present on sponsored listings; first 2-3 results are usually sponsored.
|
|
81
|
-
- **`url`**: `h2 a` href — contains the full `/dp/{ASIN}/...` URL.
|
|
82
|
-
|
|
83
|
-
## Product Detail Page Extraction
|
|
84
|
-
|
|
85
|
-
### Confirmed selectors (field-tested on B08Z6X4NK3)
|
|
86
|
-
```python
|
|
87
|
-
detail = js("""
|
|
88
|
-
({
|
|
89
|
-
title: document.querySelector('#productTitle')?.innerText?.trim(),
|
|
90
|
-
price: (function() {
|
|
91
|
-
var whole = document.querySelector('.a-price-whole')?.innerText?.replace(/[\\n.]/g,'');
|
|
92
|
-
var frac = document.querySelector('.a-price-fraction')?.innerText;
|
|
93
|
-
return (whole && frac) ? '$' + whole + '.' + frac
|
|
94
|
-
: document.querySelector('.a-price .a-offscreen')?.innerText || null;
|
|
95
|
-
})(),
|
|
96
|
-
list_price: document.querySelector('.basisPrice .a-offscreen')?.innerText,
|
|
97
|
-
rating: document.querySelector('#acrPopover')?.getAttribute('title'),
|
|
98
|
-
review_count: document.querySelector('#acrCustomerReviewText')?.innerText,
|
|
99
|
-
availability: document.querySelector('#availability span')?.innerText?.trim(),
|
|
100
|
-
brand: document.querySelector('#bylineInfo')?.innerText?.trim(),
|
|
101
|
-
asin: document.querySelector('input[name="ASIN"]')?.value,
|
|
102
|
-
bullet_points: Array.from(document.querySelectorAll('#feature-bullets li span.a-list-item'))
|
|
103
|
-
.map(e => e.innerText?.trim()).filter(t => t)
|
|
104
|
-
})
|
|
105
|
-
""")
|
|
106
|
-
```
|
|
107
|
-
|
|
108
|
-
### Price field notes
|
|
109
|
-
- `#priceblock_ourprice` and `#priceblock_dealprice` are **legacy** — they return `null` on modern product pages.
|
|
110
|
-
- Construct price from `.a-price-whole` + `.a-price-fraction` (both stripped of `\n` and `.`).
|
|
111
|
-
- As a fallback: first `.a-price .a-offscreen` on the page also works (confirmed `$69.99`).
|
|
112
|
-
- `list_price` from `.basisPrice .a-offscreen` shows the crossed-out "was" price when a discount exists.
|
|
113
|
-
|
|
114
|
-
## Best Sellers Page
|
|
115
|
-
|
|
116
|
-
URL: `https://www.amazon.com/Best-Sellers-{Category}/zgbs/{slug}/`
|
|
117
|
-
e.g. `https://www.amazon.com/Best-Sellers-Electronics/zgbs/electronics/`
|
|
118
|
-
|
|
119
|
-
### DOM structure (2025)
|
|
120
|
-
`.zg-item-immersion` **does not exist** — Amazon migrated to CSS modules. Use `[data-asin]` anchored on `[id="gridItemRoot"]`:
|
|
121
|
-
|
|
122
|
-
```python
|
|
123
|
-
goto_url("https://www.amazon.com/Best-Sellers-Electronics/zgbs/electronics/")
|
|
124
|
-
wait_for_load()
|
|
125
|
-
wait(2)
|
|
126
|
-
|
|
127
|
-
items = js("""
|
|
128
|
-
Array.from(document.querySelectorAll('[data-asin]')).map(el => {
|
|
129
|
-
var container = el.closest('[id="gridItemRoot"]') || el;
|
|
130
|
-
return {
|
|
131
|
-
asin: el.getAttribute('data-asin'),
|
|
132
|
-
rank: container.querySelector('[class*="zg-bdg-text"]')?.innerText,
|
|
133
|
-
title: container.querySelector('img[alt]')?.getAttribute('alt'),
|
|
134
|
-
price: container.querySelector('.p13n-sc-price, .a-size-base.a-color-price')?.innerText,
|
|
135
|
-
url: 'https://www.amazon.com/dp/' + el.getAttribute('data-asin')
|
|
136
|
-
}
|
|
137
|
-
}).filter(r => r.rank)
|
|
138
|
-
""")
|
|
139
|
-
```
|
|
140
|
-
|
|
141
|
-
Note: Title comes from the product image `alt` attribute — the text title elements use obfuscated CSS module class names that change between deployments.
|
|
142
|
-
|
|
143
|
-
## Pagination
|
|
144
|
-
|
|
145
|
-
```python
|
|
146
|
-
# Get next page URL directly
|
|
147
|
-
next_url = js("document.querySelector('.s-pagination-next')?.href")
|
|
148
|
-
if next_url:
|
|
149
|
-
goto_url(next_url)
|
|
150
|
-
wait_for_load()
|
|
151
|
-
wait(2)
|
|
152
|
-
|
|
153
|
-
# Or construct by page number
|
|
154
|
-
goto_url("https://www.amazon.com/s?k=wireless+mouse&page=2")
|
|
155
|
-
```
|
|
156
|
-
|
|
157
|
-
## Result Count
|
|
158
|
-
|
|
159
|
-
```python
|
|
160
|
-
count_text = js("document.querySelector('[data-component-type=\"s-result-info-bar\"] h1')?.innerText?.trim()")
|
|
161
|
-
# Returns e.g.: '1-16 of over 40,000 results for "wireless mouse"\nSort by:\n...'
|
|
162
|
-
# Extract just the count: count_text.split('\n')[0]
|
|
163
|
-
```
|
|
164
|
-
|
|
165
|
-
## CAPTCHA Detection
|
|
166
|
-
|
|
167
|
-
No CAPTCHA was encountered during testing with a logged-in Chrome session. To detect defensively:
|
|
168
|
-
|
|
169
|
-
```python
|
|
170
|
-
def check_captcha():
|
|
171
|
-
text = js("document.body.innerText.slice(0,500)") or ""
|
|
172
|
-
url = page_info()["url"]
|
|
173
|
-
return (
|
|
174
|
-
"captcha" in text.lower()
|
|
175
|
-
or "enter the characters" in text.lower()
|
|
176
|
-
or "sorry, we just need to make sure" in text.lower()
|
|
177
|
-
or "captcha" in url.lower()
|
|
178
|
-
or "validateCaptcha" in url
|
|
179
|
-
)
|
|
180
|
-
|
|
181
|
-
if check_captcha():
|
|
182
|
-
raise RuntimeError("Amazon CAPTCHA hit — stop and notify user")
|
|
183
|
-
```
|
|
184
|
-
|
|
185
|
-
Amazon may serve a CAPTCHA on fresh/anonymous sessions. Using the browser's existing logged-in session avoids this in practice.
|
|
186
|
-
|
|
187
|
-
## Gotchas
|
|
188
|
-
|
|
189
|
-
- **`goto_url()` silent failure**: On first visit, use `new_tab(url)` instead. After the tab is on Amazon, `goto_url()` works.
|
|
190
|
-
- **`.zg-item-immersion` is gone**: Best Sellers page uses CSS module classes (obfuscated). Use `[data-asin]` + `img[alt]` for title.
|
|
191
|
-
- **`.a-size-base.s-underline-text` is unreliable for review count**: On sponsored results it shows unrelated text (e.g. "Xbox"). Use `[aria-label*="ratings"]` instead.
|
|
192
|
-
- **`#priceblock_ourprice` is legacy**: Returns `null` on modern pages. Construct from `.a-price-whole` + `.a-price-fraction`.
|
|
193
|
-
- **Sponsored results appear first**: First 2-3 results are almost always `is_sponsored: true`. Filter them out with `!el.querySelector('.puis-sponsored-label-text')` when you need organic results.
|
|
194
|
-
- **`data-asin` can be empty string on non-product rows**: Filter with `.filter(r => r.asin)`.
|
|
195
|
-
- **Price split DOM**: `.a-price-whole` innerText includes a trailing `\n.` — strip it: `.replace(/[\n.]/g,'')`.
|
|
196
|
-
- **ASIN from URL**: Use `/dp/([A-Z0-9]{10})/` regex on the product URL. `data-asin` on search results is always the canonical ASIN.
|
|
197
|
-
- **`?th=1` redirect**: Amazon appends `?th=1` (and sometimes `?psc=1`) to product URLs after redirect. This is normal — `input[name="ASIN"]` always has the clean ASIN.
|
|
198
|
-
- **Wait 2s after `wait_for_load()`**: Amazon search results load the listing cards asynchronously. `readyState=complete` fires before cards render. A hard 2s wait is required.
|
|
1
|
+
# Amazon — Product Search & Data Extraction
|
|
2
|
+
|
|
3
|
+
Field-tested against amazon.com on 2025-04-18 using a logged-in Chrome session.
|
|
4
|
+
No CAPTCHA or bot detection was triggered during any test run.
|
|
5
|
+
|
|
6
|
+
## Navigation
|
|
7
|
+
|
|
8
|
+
### Direct search URL (fastest, always use this)
|
|
9
|
+
```python
|
|
10
|
+
goto_url("https://www.amazon.com/s?k=mechanical+keyboard")
|
|
11
|
+
wait_for_load()
|
|
12
|
+
wait(2) # dynamic content needs ~2s after readyState=complete
|
|
13
|
+
```
|
|
14
|
+
|
|
15
|
+
### Search box typing (use when you need category filtering)
|
|
16
|
+
```python
|
|
17
|
+
goto_url("https://www.amazon.com")
|
|
18
|
+
wait_for_load()
|
|
19
|
+
wait(1)
|
|
20
|
+
js("document.querySelector('#twotabsearchtextbox').focus()")
|
|
21
|
+
js("document.querySelector('#twotabsearchtextbox').click()")
|
|
22
|
+
wait(0.3)
|
|
23
|
+
type_text("wireless mouse")
|
|
24
|
+
wait(0.3)
|
|
25
|
+
press_key("Enter")
|
|
26
|
+
wait_for_load()
|
|
27
|
+
wait(2)
|
|
28
|
+
```
|
|
29
|
+
|
|
30
|
+
### Direct product page
|
|
31
|
+
```python
|
|
32
|
+
# URL pattern: /dp/{ASIN} or /dp/{ASIN}?th=1 (Amazon may redirect to add ?th=1)
|
|
33
|
+
goto_url("https://www.amazon.com/dp/B08Z6X4NK3")
|
|
34
|
+
wait_for_load()
|
|
35
|
+
wait(2)
|
|
36
|
+
```
|
|
37
|
+
|
|
38
|
+
## Session Gotcha
|
|
39
|
+
|
|
40
|
+
**Always use `new_tab()` when opening Amazon for the first time in a harness session.**
|
|
41
|
+
`goto_url()` can silently fail to navigate if the current tab resists the navigation
|
|
42
|
+
(observed when the daemon attached to a different real tab). The safe pattern:
|
|
43
|
+
|
|
44
|
+
```python
|
|
45
|
+
tid = new_tab("https://www.amazon.com/s?k=mechanical+keyboard")
|
|
46
|
+
wait_for_load()
|
|
47
|
+
wait(2)
|
|
48
|
+
```
|
|
49
|
+
|
|
50
|
+
After that, `goto_url()` works fine within the same Amazon session.
|
|
51
|
+
|
|
52
|
+
## Search Results Extraction
|
|
53
|
+
|
|
54
|
+
### Container selector
|
|
55
|
+
`[data-component-type="s-search-result"]` — confirmed working, yields ~22 results per page.
|
|
56
|
+
|
|
57
|
+
### Full extraction (field-tested)
|
|
58
|
+
```python
|
|
59
|
+
results = js("""
|
|
60
|
+
Array.from(document.querySelectorAll('[data-component-type="s-search-result"]')).map(el => ({
|
|
61
|
+
asin: el.getAttribute('data-asin'),
|
|
62
|
+
title: el.querySelector('h2 span')?.innerText?.trim(),
|
|
63
|
+
price: el.querySelector('.a-price .a-offscreen')?.innerText,
|
|
64
|
+
list_price: el.querySelector('.a-text-price .a-offscreen')?.innerText,
|
|
65
|
+
rating: el.querySelector('[aria-label*="out of 5 stars"]')?.getAttribute('aria-label')?.split(' ')[0],
|
|
66
|
+
reviews: el.querySelector('[aria-label*="ratings"]')?.getAttribute('aria-label'),
|
|
67
|
+
is_sponsored: !!el.querySelector('.puis-sponsored-label-text'),
|
|
68
|
+
url: el.querySelector('h2 a')?.href
|
|
69
|
+
}))
|
|
70
|
+
""")
|
|
71
|
+
```
|
|
72
|
+
|
|
73
|
+
### Field notes
|
|
74
|
+
- **`asin`**: `data-asin` attribute on the container div — always present, matches the `/dp/{ASIN}` URL.
|
|
75
|
+
- **`title`**: `h2 span` works consistently. `h2 a.a-link-normal span` also works.
|
|
76
|
+
- **`price`**: `.a-price .a-offscreen` returns the formatted string e.g. `"$69.99"`. Use this, not `.a-price-whole`.
|
|
77
|
+
- **`list_price`**: `.a-text-price .a-offscreen` — only present when item is on sale (was/now pricing).
|
|
78
|
+
- **`rating`**: Use `aria-label` on `[aria-label*="out of 5 stars"]` — gives `"4.5 out of 5 stars, rating details"`, split on space for the number.
|
|
79
|
+
- **`reviews`**: Use `[aria-label*="ratings"]` attribute — gives `"1,514 ratings"`. Do NOT use `.a-size-base.s-underline-text` — that element exists on sponsored results and shows "Xbox" (a cross-sell widget text).
|
|
80
|
+
- **`is_sponsored`**: `.puis-sponsored-label-text` is present on sponsored listings; first 2-3 results are usually sponsored.
|
|
81
|
+
- **`url`**: `h2 a` href — contains the full `/dp/{ASIN}/...` URL.
|
|
82
|
+
|
|
83
|
+
## Product Detail Page Extraction
|
|
84
|
+
|
|
85
|
+
### Confirmed selectors (field-tested on B08Z6X4NK3)
|
|
86
|
+
```python
|
|
87
|
+
detail = js("""
|
|
88
|
+
({
|
|
89
|
+
title: document.querySelector('#productTitle')?.innerText?.trim(),
|
|
90
|
+
price: (function() {
|
|
91
|
+
var whole = document.querySelector('.a-price-whole')?.innerText?.replace(/[\\n.]/g,'');
|
|
92
|
+
var frac = document.querySelector('.a-price-fraction')?.innerText;
|
|
93
|
+
return (whole && frac) ? '$' + whole + '.' + frac
|
|
94
|
+
: document.querySelector('.a-price .a-offscreen')?.innerText || null;
|
|
95
|
+
})(),
|
|
96
|
+
list_price: document.querySelector('.basisPrice .a-offscreen')?.innerText,
|
|
97
|
+
rating: document.querySelector('#acrPopover')?.getAttribute('title'),
|
|
98
|
+
review_count: document.querySelector('#acrCustomerReviewText')?.innerText,
|
|
99
|
+
availability: document.querySelector('#availability span')?.innerText?.trim(),
|
|
100
|
+
brand: document.querySelector('#bylineInfo')?.innerText?.trim(),
|
|
101
|
+
asin: document.querySelector('input[name="ASIN"]')?.value,
|
|
102
|
+
bullet_points: Array.from(document.querySelectorAll('#feature-bullets li span.a-list-item'))
|
|
103
|
+
.map(e => e.innerText?.trim()).filter(t => t)
|
|
104
|
+
})
|
|
105
|
+
""")
|
|
106
|
+
```
|
|
107
|
+
|
|
108
|
+
### Price field notes
|
|
109
|
+
- `#priceblock_ourprice` and `#priceblock_dealprice` are **legacy** — they return `null` on modern product pages.
|
|
110
|
+
- Construct price from `.a-price-whole` + `.a-price-fraction` (both stripped of `\n` and `.`).
|
|
111
|
+
- As a fallback: first `.a-price .a-offscreen` on the page also works (confirmed `$69.99`).
|
|
112
|
+
- `list_price` from `.basisPrice .a-offscreen` shows the crossed-out "was" price when a discount exists.
|
|
113
|
+
|
|
114
|
+
## Best Sellers Page
|
|
115
|
+
|
|
116
|
+
URL: `https://www.amazon.com/Best-Sellers-{Category}/zgbs/{slug}/`
|
|
117
|
+
e.g. `https://www.amazon.com/Best-Sellers-Electronics/zgbs/electronics/`
|
|
118
|
+
|
|
119
|
+
### DOM structure (2025)
|
|
120
|
+
`.zg-item-immersion` **does not exist** — Amazon migrated to CSS modules. Use `[data-asin]` anchored on `[id="gridItemRoot"]`:
|
|
121
|
+
|
|
122
|
+
```python
|
|
123
|
+
goto_url("https://www.amazon.com/Best-Sellers-Electronics/zgbs/electronics/")
|
|
124
|
+
wait_for_load()
|
|
125
|
+
wait(2)
|
|
126
|
+
|
|
127
|
+
items = js("""
|
|
128
|
+
Array.from(document.querySelectorAll('[data-asin]')).map(el => {
|
|
129
|
+
var container = el.closest('[id="gridItemRoot"]') || el;
|
|
130
|
+
return {
|
|
131
|
+
asin: el.getAttribute('data-asin'),
|
|
132
|
+
rank: container.querySelector('[class*="zg-bdg-text"]')?.innerText,
|
|
133
|
+
title: container.querySelector('img[alt]')?.getAttribute('alt'),
|
|
134
|
+
price: container.querySelector('.p13n-sc-price, .a-size-base.a-color-price')?.innerText,
|
|
135
|
+
url: 'https://www.amazon.com/dp/' + el.getAttribute('data-asin')
|
|
136
|
+
}
|
|
137
|
+
}).filter(r => r.rank)
|
|
138
|
+
""")
|
|
139
|
+
```
|
|
140
|
+
|
|
141
|
+
Note: Title comes from the product image `alt` attribute — the text title elements use obfuscated CSS module class names that change between deployments.
|
|
142
|
+
|
|
143
|
+
## Pagination
|
|
144
|
+
|
|
145
|
+
```python
|
|
146
|
+
# Get next page URL directly
|
|
147
|
+
next_url = js("document.querySelector('.s-pagination-next')?.href")
|
|
148
|
+
if next_url:
|
|
149
|
+
goto_url(next_url)
|
|
150
|
+
wait_for_load()
|
|
151
|
+
wait(2)
|
|
152
|
+
|
|
153
|
+
# Or construct by page number
|
|
154
|
+
goto_url("https://www.amazon.com/s?k=wireless+mouse&page=2")
|
|
155
|
+
```
|
|
156
|
+
|
|
157
|
+
## Result Count
|
|
158
|
+
|
|
159
|
+
```python
|
|
160
|
+
count_text = js("document.querySelector('[data-component-type=\"s-result-info-bar\"] h1')?.innerText?.trim()")
|
|
161
|
+
# Returns e.g.: '1-16 of over 40,000 results for "wireless mouse"\nSort by:\n...'
|
|
162
|
+
# Extract just the count: count_text.split('\n')[0]
|
|
163
|
+
```
|
|
164
|
+
|
|
165
|
+
## CAPTCHA Detection
|
|
166
|
+
|
|
167
|
+
No CAPTCHA was encountered during testing with a logged-in Chrome session. To detect defensively:
|
|
168
|
+
|
|
169
|
+
```python
|
|
170
|
+
def check_captcha():
|
|
171
|
+
text = js("document.body.innerText.slice(0,500)") or ""
|
|
172
|
+
url = page_info()["url"]
|
|
173
|
+
return (
|
|
174
|
+
"captcha" in text.lower()
|
|
175
|
+
or "enter the characters" in text.lower()
|
|
176
|
+
or "sorry, we just need to make sure" in text.lower()
|
|
177
|
+
or "captcha" in url.lower()
|
|
178
|
+
or "validateCaptcha" in url
|
|
179
|
+
)
|
|
180
|
+
|
|
181
|
+
if check_captcha():
|
|
182
|
+
raise RuntimeError("Amazon CAPTCHA hit — stop and notify user")
|
|
183
|
+
```
|
|
184
|
+
|
|
185
|
+
Amazon may serve a CAPTCHA on fresh/anonymous sessions. Using the browser's existing logged-in session avoids this in practice.
|
|
186
|
+
|
|
187
|
+
## Gotchas
|
|
188
|
+
|
|
189
|
+
- **`goto_url()` silent failure**: On first visit, use `new_tab(url)` instead. After the tab is on Amazon, `goto_url()` works.
|
|
190
|
+
- **`.zg-item-immersion` is gone**: Best Sellers page uses CSS module classes (obfuscated). Use `[data-asin]` + `img[alt]` for title.
|
|
191
|
+
- **`.a-size-base.s-underline-text` is unreliable for review count**: On sponsored results it shows unrelated text (e.g. "Xbox"). Use `[aria-label*="ratings"]` instead.
|
|
192
|
+
- **`#priceblock_ourprice` is legacy**: Returns `null` on modern pages. Construct from `.a-price-whole` + `.a-price-fraction`.
|
|
193
|
+
- **Sponsored results appear first**: First 2-3 results are almost always `is_sponsored: true`. Filter them out with `!el.querySelector('.puis-sponsored-label-text')` when you need organic results.
|
|
194
|
+
- **`data-asin` can be empty string on non-product rows**: Filter with `.filter(r => r.asin)`.
|
|
195
|
+
- **Price split DOM**: `.a-price-whole` innerText includes a trailing `\n.` — strip it: `.replace(/[\n.]/g,'')`.
|
|
196
|
+
- **ASIN from URL**: Use `/dp/([A-Z0-9]{10})/` regex on the product URL. `data-asin` on search results is always the canonical ASIN.
|
|
197
|
+
- **`?th=1` redirect**: Amazon appends `?th=1` (and sometimes `?psc=1`) to product URLs after redirect. This is normal — `input[name="ASIN"]` always has the clean ASIN.
|
|
198
|
+
- **Wait 2s after `wait_for_load()`**: Amazon search results load the listing cards asynchronously. `readyState=complete` fires before cards render. A hard 2s wait is required.
|