npm - scrape-do-mcp - Versions diffs - 0.2.0 → 0.4.0 - Mend

scrape-do-mcp 0.2.0 → 0.4.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (5) hide show

package/LICENSE ADDED Viewed

@@ -0,0 +1,21 @@
+MIT License
+Copyright (c) 2026 Abel
+Permission is hereby granted, free of charge, to any person obtaining a copy
+of this software and associated documentation files (the "Software"), to deal
+in the Software without restriction, including without limitation the rights
+to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
+copies of the Software, and to permit persons to whom the Software is
+furnished to do so, subject to the following conditions:
+The above copyright notice and this permission notice shall be included in all
+copies or substantial portions of the Software.
+THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
+IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
+FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
+AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
+LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
+OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
+SOFTWARE.

package/README-ZH.md CHANGED Viewed

@@ -2,25 +2,44 @@
 [English Docs](./README.md) | 中文文档
-Scrape.do 网页抓取和 Google 搜索 MCP 服务器 - 支持反机器人保护
-## 功能特点
-- **scrape_url**: 抓取任意网页并返回 Markdown 格式内容。自动绕过 Cloudflare、WAF、CAPTCHA 和反爬虫保护。支持 JavaScript 渲染、截图、地理定位（150+ 国家）、设备模拟、会话保持、自定义请求头/Cookie、超时控制等。
-- **google_search**: 搜索 Google 并返回结构化的 SERP 结果 JSON。包含自然搜索结果、知识图谱、本地商家、新闻、相关问题等。支持地理定位和设备筛选。
-## 可用工具
-| 工具 | 描述 |
-|------|------|
-| `scrape_url` | 全功能网页抓取，反机器人绕过。支持：JavaScript 渲染、截图（PNG）、地理定位（150+ 国家）、设备模拟（桌面/手机/平板）、会话保持、自定义请求头/Cookie、超时控制等。 |
-| `google_search` | Google SERP 结构化抓取，返回 JSON。支持：自然搜索结果、知识图谱、本地商家、新闻、People Also Ask、视频结果等，支持地理定位、设备筛选、时间筛选。 |
+这是一个把 Scrape.do 官方文档中主要 API 能力封装成 MCP 工具的包：主抓取 API、Google Search API、Amazon Scraper API、Async API，以及 Proxy Mode 配置辅助工具。
+官方文档：https://scrape.do/documentation/
+## 覆盖范围
+- `scrape_url`：主 Scrape.do 抓取 API，支持 JS 渲染、地理定位、会话保持、截图、ReturnJSON、浏览器交互、Cookie、Header 转发。
+- `google_search`：结构化 Google 搜索 API，支持 `google_domain`、`location`、`uule`、`lr`、`cr`、`safe`、`nfpr`、`filter`、分页、原始 HTML。
+- `amazon_product`：Amazon PDP 接口。
+- `amazon_offer_listing`：Amazon 卖家报价接口。
+- `amazon_search`：Amazon 搜索 / 类目结果接口。
+- `amazon_raw_html`：Amazon 原始 HTML 接口。
+- `async_create_job`、`async_get_job`、`async_get_task`、`async_list_jobs`、`async_cancel_job`、`async_get_account`：Async API，并同时兼容 MCP 风格字段和官方字段名。
+- `proxy_mode_config`：生成更贴近官方文档的 Proxy Mode 连接信息、默认参数串和证书信息。
+## 兼容性说明
+- `scrape_url` 同时支持 MCP 友好的别名和官方参数名：
+  - `render_js` 或 `render`
+  - `super_proxy` 或 `super`
+  - `screenshot` 或 `screenShot`
+- `google_search` 同时支持：
+  - `query` 或 `q`
+  - `country` 或 `gl`
+  - `language` 或 `hl`
+  - `domain` 或 `google_domain`
+  - `includeHtml` 或 `include_html`
+- `async_create_job` 同时接受 `targets`、`render`、`webhookUrl` 这类别名，以及官方字段 `Targets`、`Render`、`WebhookURL`。
+- `async_get_job`、`async_get_task`、`async_cancel_job` 同时接受 `jobId` / `taskId` 和官方 `jobID` / `taskID`。
+- `async_list_jobs` 同时支持 `pageSize` 和官方 `page_size`。
+- `scrape_url` 里的 Header 转发请使用 `headers` + `header_mode`（`custom` / `extra` / `forward`）。
+- 截图结果会保留官方 JSON 响应，同时附加 MCP 图片内容，尽量兼顾官方格式和 MCP 可视化体验。
+- `scrape_url` 现在默认使用 `output="raw"`，更贴近官方 API。
+- `scrape_url` 会在 `structuredContent` 里附带响应元数据，便于在 MCP 中查看 `pureCookies`、`transparentResponse` 和二进制响应信息。
 ## 安装
-### 快速安装（推荐）
-在终端中运行以下命令：
+### 快速安装
 ```bash
 claude mcp add-json scrape-do --scope user '{
@@ -28,13 +47,11 @@ claude mcp add-json scrape-do --scope user '{
   "command": "npx",
   "args": ["-y", "scrape-do-mcp"],
   "env": {
-    "SCRAPE_DO_TOKEN": "你的Token"
+    "SCRAPE_DO_TOKEN": "YOUR_TOKEN_HERE"
   }
 }'
 ```
-将 `你的Token` 替换为你在 https://app.scrape.do 获取的 API Token。
 ### Claude Desktop
 添加到 `~/.claude.json`：
@@ -46,181 +63,57 @@ claude mcp add-json scrape-do --scope user '{
       "command": "npx",
       "args": ["-y", "scrape-do-mcp"],
       "env": {
-        "SCRAPE_DO_TOKEN": "你的Token"
+        "SCRAPE_DO_TOKEN": "YOUR_TOKEN_HERE"
       }
     }
   }
 }
 ```
-获取免费 API Token：https://app.scrape.do
-## 使用方法
-### scrape_url
-抓取任意网页并获取 Markdown 内容。
-```typescript
-// 完整参数
-{
-  // 必需
-  url: string,                    // 要抓取的网址
-  // 代理和渲染
-  render_js?: boolean,            // 渲染 JavaScript（默认 false）
-  super_proxy?: boolean,           // 使用住宅/移动代理（消耗 10 积分）
-  geoCode?: string,               // 国家代码（如 'us', 'cn', 'gb'）
-  regionalGeoCode?: string,       // 区域（如 'asia', 'europe'）
-  device?: "desktop" | "mobile" | "tablet",  // 设备类型
-  sessionId?: number,             // 保持相同 IP 的会话
-  // 超时和重试
-  timeout?: number,               // 最大超时时间（毫秒，默认 60000）
-  retryTimeout?: number,          // 重试超时（毫秒）
-  disableRetry?: boolean,         // 禁用自动重试
-  // 输出格式
-  output?: "markdown" | "raw",  // 输出格式（默认 markdown）
-  returnJSON?: boolean,           // 以 JSON 形式返回网络请求
-  transparentResponse?: boolean,   // 返回原始响应
-  // 截图
-  screenshot?: boolean,           // 截图（PNG）
-  fullScreenShot?: boolean,      // 全页截图
-  particularScreenShot?: string,  // 元素截图（CSS 选择器）
-  // 浏览器控制
-  waitSelector?: string,          // 等待元素（CSS 选择器）
-  customWait?: number,           // 加载后等待时间（毫秒）
-  waitUntil?: "domcontentloaded" | "load" | "networkidle" | "networkidle0" | "networkidle2",
-  width?: number,                // 视口宽度（默认 1920）
-  height?: number,               // 视口高度（默认 1080）
-  blockResources?: boolean,       // 阻止 CSS/图片/字体（默认 true）
-  // 请求头和 Cookie
-  customHeaders?: boolean,        // 处理所有请求头
-  extraHeaders?: boolean,       // 添加额外请求头
-  forwardHeaders?: boolean,      // 转发你的请求头
-  setCookies?: string,          // 设置 Cookie（格式：'name=value; name2=value2'）
-  pureCookies?: boolean,        // 返回原始 Cookie
-  // 其他
-  disableRedirection?: boolean, // 禁用重定向
-  callback?: string             // Webhook URL 异步接收结果
-}
-```
-### google_search
-搜索 Google 并获取结构化结果。
-```typescript
-// 完整参数
-{
-  // 必需
-  query: string,                  // 搜索关键词
-  // 搜索选项
-  country?: string,                // 国家代码（默认 'us'）
-  language?: string,              // 界面语言（默认 'en'）
-  domain?: string,               // Google 域名（如 'com', 'co.uk'）
-  page?: number,                  // 页码（默认 1）
-  num?: number,                  // 每页结果数（默认 10）
-  time_period?: "" | "last_hour" | "last_day" | "last_week" | "last_month" | "last_year",
-  device?: "desktop" | "mobile", // 设备类型
-  // 高级
-  includeHtml?: boolean           // 在响应中包含原始 HTML
-}
-```
-## 使用示例
-### 抓取网页
-```
-请抓取 https://github.com 并给我主要内容（Markdown 格式）。
-```
-### Google 搜索
-```
-搜索 "2026 年最佳 Python Web 框架"，返回前 5 个结果。
-```
+Token 获取地址：https://app.scrape.do
-### 带筛选条件的搜索
-```
-用中文搜索 "AI 新闻"，限定为中国，过去一周的内容。
-```
+## 可用工具
-### JavaScript 渲染
-```
-抓取这个 React 单页应用：https://example-spa.com
-使用 render_js=true 获取完整渲染内容。
-```
+| 工具 | 用途 |
+|------|------|
+| `scrape_url` | 主 Scrape.do 抓取 API |
+| `google_search` | 结构化 Google 搜索结果 |
+| `amazon_product` | Amazon PDP 结构化数据 |
+| `amazon_offer_listing` | Amazon 全量卖家报价 |
+| `amazon_search` | Amazon 搜索 / 类目结果 |
+| `amazon_raw_html` | Amazon 原始 HTML |
+| `async_create_job` | 创建 Async API 任务 |
+| `async_get_job` | 查询 Async job 详情 |
+| `async_get_task` | 查询 Async task 详情 |
+| `async_list_jobs` | 列出 Async jobs |
+| `async_cancel_job` | 取消 Async job |
+| `async_get_account` | 查询 Async 账户 / 并发信息 |
+| `proxy_mode_config` | 生成 Proxy Mode 配置 |
-### 获取原始 HTML
-```
-抓取 https://example.com 并返回原始 HTML 而不是 markdown。
-```
+## 示例提示词
-### 地理定位抓取
-```
-用日本（geoCode: jp）的 IP 抓取 https://www.amazon.com/product/12345
+```text
+抓取 https://example.com，开启 render=true，并等待 #app 出现。
 ```
-### 移动设备模拟
-```
-用移动设备抓取 https://example.com 来查看移动版页面。
+```text
+搜索 "open source MCP servers"，并设置 google_domain=google.co.uk 与 lr=lang_en。
 ```
-### 截图
-```
-截取 https://example.com 的屏幕截图并返回图片。
-```
-### 等待元素加载
-```
-抓取 https://example.com 但先等待 id 为 "content" 的元素加载完成。
+```text
+获取 Amazon ASIN B0C7BKZ883 在美国 zipcode=10001 下的 PDP 数据。
 ```
-### 会话保持
-```
-使用会话 ID 12345 抓取 https://example.com 的多个页面，以保持相同的 IP。
+```text
+帮我为这 20 个 URL 创建一个异步抓取任务，并返回 job ID。
 ```
-## 与其他工具对比
-| 功能 | scrape-do-mcp | Firecrawl | Browserbase |
-|------|--------------|-----------|-------------|
-| Google 搜索 | ✅ | ❌ | ❌ |
-| 免费积分 | 1,000 | 500 | 无 |
-| 价格 | 按量付费 | $19+/月 | $15+/月 |
-| MCP 原生 | ✅ | ✅ | ❌ |
-| 配置难度 | 无需配置 | 需要 API key | 需要 API key + 浏览器 |
-### 为什么选择 scrape-do-mcp？
-- **零配置**：获取 Token 后即可立即使用
-- **一体化**：网页抓取和 Google 搜索集于一个 MCP
-- **反爬虫绕过**：自动处理 Cloudflare、WAF、CAPTCHA
-- **成本效益**：按需付费，免费额度可用
-## 积分消耗
-| 工具 | 积分消耗 |
-|------|---------|
-| scrape_url（普通） | 1 积分/次 |
-| scrape_url（super_proxy） | 10 积分/次 |
-| google_search | 1 积分/次 |
-**免费：每月 1,000 积分** - 无需信用卡：https://app.scrape.do
 ## 开发
 ```bash
 npm install
 npm run build
-npm run dev  # 开发模式运行
+npm run dev
 ```
 ## 许可证

package/README.md CHANGED Viewed

@@ -2,25 +2,44 @@
 [中文文档](./README-ZH.md) | English
-MCP Server for Scrape.do - Web Scraping & Google Search with anti-bot bypass
-## Features
-- **scrape_url**: Scrape any webpage and return content as Markdown. Automatically bypasses Cloudflare, WAFs, CAPTCHAs, and anti-bot protection. Supports JavaScript rendering, screenshots, geo-targeting (150+ countries), device emulation, session persistence, and more.
-- **google_search**: Search Google and return structured SERP results as JSON. Returns organic results, knowledge graph, local businesses, news stories, and more. Supports geo-targeting and device filtering.
-## Available Tools
-| Tool | Description |
-|------|-------------|
-| `scrape_url` | Full-featured web scraping with anti-bot bypass. Supports: JavaScript rendering, screenshots (PNG), geo-targeting (150+ countries), device emulation (desktop/mobile/tablet), session persistence, custom headers/cookies, timeout control, and more. |
-| `google_search` | Google SERP scraping returning structured JSON. Supports: organic results, knowledge graph, local businesses, news, People Also Ask, video results, geo-targeting, device filtering, and time-based filtering. |
+An MCP server that wraps Scrape.do's documented APIs in one package: the main scraping API, Google Search API, Amazon Scraper API, Async API, and a Proxy Mode configuration helper.
+Official docs: https://scrape.do/documentation/
+## Coverage
+- `scrape_url`: Main Scrape.do API with JS rendering, geo-targeting, session persistence, screenshots, ReturnJSON, browser interactions, cookies, and header forwarding.
+- `google_search`: Structured Google SERP API with `google_domain`, `location`, `uule`, `lr`, `cr`, `safe`, `nfpr`, `filter`, pagination, and optional raw HTML.
+- `amazon_product`: Amazon PDP endpoint.
+- `amazon_offer_listing`: Amazon offer listing endpoint.
+- `amazon_search`: Amazon search/category endpoint.
+- `amazon_raw_html`: Raw HTML Amazon endpoint with geo-targeting.
+- `async_create_job`, `async_get_job`, `async_get_task`, `async_list_jobs`, `async_cancel_job`, `async_get_account`: Async API coverage with both MCP-friendly aliases and official field names.
+- `proxy_mode_config`: Builds official Proxy Mode connection details, default parameter strings, and CA certificate references.
+## Compatibility Notes
+- `scrape_url` supports both MCP-friendly aliases and official parameter names:
+  - `render_js` or `render`
+  - `super_proxy` or `super`
+  - `screenshot` or `screenShot`
+- `google_search` supports:
+  - `query` or `q`
+  - `country` or `gl`
+  - `language` or `hl`
+  - `domain` or `google_domain`
+  - `includeHtml` or `include_html`
+- `async_create_job` accepts both alias fields like `targets`, `render`, `webhookUrl` and official Async API fields like `Targets`, `Render`, `WebhookURL`.
+- `async_get_job`, `async_get_task`, and `async_cancel_job` accept both `jobId`/`taskId` and official `jobID`/`taskID`.
+- `async_list_jobs` accepts both `pageSize` and official `page_size`.
+- For header forwarding in `scrape_url`, pass `headers` plus `header_mode` (`custom`, `extra`, or `forward`).
+- Screenshot responses preserve the official Scrape.do JSON body and also attach MCP image content when screenshots are present.
+- `scrape_url` now defaults to `output="raw"` to match the official API more closely.
+- `scrape_url` includes response metadata in `structuredContent`, which helps surface `pureCookies`, `transparentResponse`, and binary responses inside MCP.
 ## Installation
-### Quick Install (Recommended)
-Run this command in your terminal:
+### Quick Install
 ```bash
 claude mcp add-json scrape-do --scope user '{
@@ -33,11 +52,9 @@ claude mcp add-json scrape-do --scope user '{
 }'
 ```
-Replace `YOUR_TOKEN_HERE` with your Scrape.do API token from https://app.scrape.do
 ### Claude Desktop
-Add to your `~/.claude.json`:
+Add this to `~/.claude.json`:
 ```json
 {
@@ -46,183 +63,57 @@ Add to your `~/.claude.json`:
       "command": "npx",
       "args": ["-y", "scrape-do-mcp"],
       "env": {
-        "SCRAPE_DO_TOKEN": "your_token_here"
+        "SCRAPE_DO_TOKEN": "YOUR_TOKEN_HERE"
       }
     }
   }
 }
 ```
-Get your free API token at: https://app.scrape.do
+Get your token at https://app.scrape.do
-## Usage
-### scrape_url
-Scrape any webpage and get content as Markdown.
-```typescript
-// All Parameters
-{
-  // Required
-  url: string,                    // Target URL to scrape
-  // Proxy & Rendering
-  render_js?: boolean,            // Render JavaScript (default: false)
-  super_proxy?: boolean,           // Use residential/mobile proxies (costs 10 credits)
-  geoCode?: string,               // Country code (e.g., 'us', 'cn', 'gb')
-  regionalGeoCode?: string,       // Region (e.g., 'asia', 'europe')
-  device?: "desktop" | "mobile" | "tablet",  // Device type
-  sessionId?: number,             // Keep same IP for session
-  // Timeout & Retry
-  timeout?: number,               // Max timeout in ms (default: 60000)
-  retryTimeout?: number,          // Retry timeout in ms
-  disableRetry?: boolean,         // Disable auto retry
-  // Output Format
-  output?: "markdown" | "raw",   // Output format (default: markdown)
-  returnJSON?: boolean,           // Return network requests as JSON
-  transparentResponse?: boolean,   // Return pure response
-  // Screenshot
-  screenshot?: boolean,           // Take screenshot (PNG)
-  fullScreenShot?: boolean,       // Full page screenshot
-  particularScreenShot?: string,   // Screenshot of element (CSS selector)
-  // Browser Control
-  waitSelector?: string,          // Wait for element (CSS selector)
-  customWait?: number,            // Wait time after load (ms)
-  waitUntil?: "domcontentloaded" | "load" | "networkidle" | "networkidle0" | "networkidle2",
-  width?: number,                // Viewport width (default: 1920)
-  height?: number,               // Viewport height (default: 1080)
-  blockResources?: boolean,      // Block CSS/images/fonts (default: true)
-  // Headers & Cookies
-  customHeaders?: boolean,        // Handle all headers
-  extraHeaders?: boolean,        // Add extra headers
-  forwardHeaders?: boolean,      // Forward your headers
-  setCookies?: string,           // Set cookies ('name=value; name2=value2')
-  pureCookies?: boolean,         // Return original cookies
-  // Other
-  disableRedirection?: boolean,  // Disable redirect
-  callback?: string              // Webhook URL for async results
-}
-```
-### google_search
-Search Google and get structured results.
+## Available Tools
-```typescript
-// All Parameters
-{
-  // Required
-  query: string,                  // Search query
-  // Search Options
-  country?: string,                // Country code (default: 'us')
-  language?: string,               // Interface language (default: 'en')
-  domain?: string,                // Google domain (e.g., 'com', 'co.uk')
-  page?: number,                  // Page number (default: 1)
-  num?: number,                   // Results per page (default: 10)
-  time_period?: "" | "last_hour" | "last_day" | "last_week" | "last_month" | "last_year",
-  device?: "desktop" | "mobile", // Device type
-  // Advanced
-  includeHtml?: boolean           // Include raw HTML in response
-}
-```
+| Tool | Purpose |
+|------|---------|
+| `scrape_url` | Main Scrape.do scraping API wrapper |
+| `google_search` | Structured Google search results |
+| `amazon_product` | Amazon PDP structured data |
+| `amazon_offer_listing` | Amazon seller offers |
+| `amazon_search` | Amazon keyword/category results |
+| `amazon_raw_html` | Raw Amazon HTML with geo-targeting |
+| `async_create_job` | Create Async API jobs |
+| `async_get_job` | Fetch Async job details |
+| `async_get_task` | Fetch Async task details |
+| `async_list_jobs` | List Async jobs |
+| `async_cancel_job` | Cancel Async jobs |
+| `async_get_account` | Fetch Async account/concurrency info |
+| `proxy_mode_config` | Generate Proxy Mode configuration |
 ## Example Prompts
-Here are some prompts you can use to invoke the tools:
-### Scrape a Website
-```
-Please scrape https://github.com and give me the main content as markdown.
+```text
+Scrape https://example.com with render=true and wait for #app.
 ```
-### Search Google
-```
-Search Google for "best Python web frameworks 2026" and return the top 5 results.
+```text
+Search Google for "open source MCP servers" with google_domain=google.co.uk and lr=lang_en.
 ```
-### Search with Filters
-```
-Search for "AI news" in Chinese, from China, last week.
+```text
+Get the Amazon PDP for ASIN B0C7BKZ883 in the US with zipcode 10001.
 ```
-### JavaScript Rendering
-```
-Scrape this React Single Page Application: https://example-spa.com
-Use render_js=true to get the fully rendered content.
+```text
+Create an async job for these 20 URLs and give me the job ID.
 ```
-### Get Raw HTML
-```
-Scrape https://example.com and return raw HTML instead of markdown.
-```
-### Geo-targeting
-```
-Scrape https://www.amazon.com/product/12345 as if I'm in Japan (geoCode: jp)
-```
-### Mobile Device
-```
-Scrape https://example.com using a mobile device to see the mobile version.
-```
-### Take Screenshot
-```
-Take a screenshot of https://example.com and return the image.
-```
-### Wait for Element
-```
-Scrape https://example.com but wait for the element with id "content" to load first.
-```
-### Session Persistence
-```
-Scrape multiple pages of https://example.com using sessionId 12345 to maintain the same IP.
-```
-## Comparison with Alternatives
-| Feature | scrape-do-mcp | Firecrawl | Browserbase |
-|---------|--------------|-----------|-------------|
-| Google Search | ✅ | ❌ | ❌ |
-| Free Credits | 1,000 | 500 | None |
-| Pricing | Pay per use | $19+/mo | $15+/mo |
-| MCP Native | ✅ | ✅ | ❌ |
-| Setup Required | None | API key | API key + browser |
-### Why scrape-do-mcp?
-- **Zero setup**: Just get a token and use immediately
-- **All-in-one**: Both web scraping AND Google search in one MCP
-- **Anti-bot bypass**: Automatically handles Cloudflare, WAFs, CAPTCHAs
-- **Cost-effective**: Pay only for what you use, free tier available
-## Credit Usage
-| Tool | Credit Cost |
-|------|-------------|
-| scrape_url (regular) | 1 credit/request |
-| scrape_url (super_proxy) | 10 credits/request |
-| google_search | 1 credit/request |
-**Free: 1,000 credits/month** - No credit card required: https://app.scrape.do
 ## Development
 ```bash
 npm install
 npm run build
-npm run dev  # Run in development mode
+npm run dev
 ```
 ## License