scrape-do-mcp 0.2.0 → 0.3.0
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- package/LICENSE +21 -0
- package/README-ZH.md +60 -171
- package/README.md +59 -172
- package/dist/index.js +772 -167
- package/package.json +6 -2
package/LICENSE
ADDED
|
@@ -0,0 +1,21 @@
|
|
|
1
|
+
MIT License
|
|
2
|
+
|
|
3
|
+
Copyright (c) 2026 Abel
|
|
4
|
+
|
|
5
|
+
Permission is hereby granted, free of charge, to any person obtaining a copy
|
|
6
|
+
of this software and associated documentation files (the "Software"), to deal
|
|
7
|
+
in the Software without restriction, including without limitation the rights
|
|
8
|
+
to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
|
|
9
|
+
copies of the Software, and to permit persons to whom the Software is
|
|
10
|
+
furnished to do so, subject to the following conditions:
|
|
11
|
+
|
|
12
|
+
The above copyright notice and this permission notice shall be included in all
|
|
13
|
+
copies or substantial portions of the Software.
|
|
14
|
+
|
|
15
|
+
THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
|
|
16
|
+
IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
|
|
17
|
+
FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
|
|
18
|
+
AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
|
|
19
|
+
LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
|
|
20
|
+
OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
|
|
21
|
+
SOFTWARE.
|
package/README-ZH.md
CHANGED
|
@@ -2,25 +2,40 @@
|
|
|
2
2
|
|
|
3
3
|
[English Docs](./README.md) | 中文文档
|
|
4
4
|
|
|
5
|
-
Scrape.do
|
|
6
|
-
|
|
7
|
-
|
|
8
|
-
|
|
9
|
-
|
|
10
|
-
|
|
11
|
-
|
|
12
|
-
|
|
13
|
-
|
|
14
|
-
|
|
15
|
-
|
|
16
|
-
|
|
17
|
-
|
|
5
|
+
这是一个把 Scrape.do 官方文档中主要 API 能力封装成 MCP 工具的包:主抓取 API、Google Search API、Amazon Scraper API、Async API,以及 Proxy Mode 配置辅助工具。
|
|
6
|
+
|
|
7
|
+
官方文档:https://scrape.do/documentation/
|
|
8
|
+
|
|
9
|
+
## 覆盖范围
|
|
10
|
+
|
|
11
|
+
- `scrape_url`:主 Scrape.do 抓取 API,支持 JS 渲染、地理定位、会话保持、截图、ReturnJSON、浏览器交互、Cookie、Header 转发。
|
|
12
|
+
- `google_search`:结构化 Google 搜索 API,支持 `google_domain`、`location`、`uule`、`lr`、`cr`、`safe`、`nfpr`、`filter`、分页、原始 HTML。
|
|
13
|
+
- `amazon_product`:Amazon PDP 接口。
|
|
14
|
+
- `amazon_offer_listing`:Amazon 卖家报价接口。
|
|
15
|
+
- `amazon_search`:Amazon 搜索 / 类目结果接口。
|
|
16
|
+
- `amazon_raw_html`:Amazon 原始 HTML 接口。
|
|
17
|
+
- `async_create_job`、`async_get_job`、`async_get_task`、`async_list_jobs`、`async_cancel_job`、`async_get_account`:Async API。
|
|
18
|
+
- `proxy_mode_config`:生成 Proxy Mode 的连接信息和参数字符串,不会在工具输出里泄露你的 token。
|
|
19
|
+
|
|
20
|
+
## 兼容性说明
|
|
21
|
+
|
|
22
|
+
- `scrape_url` 同时支持 MCP 友好的别名和官方参数名:
|
|
23
|
+
- `render_js` 或 `render`
|
|
24
|
+
- `super_proxy` 或 `super`
|
|
25
|
+
- `screenshot` 或 `screenShot`
|
|
26
|
+
- `google_search` 同时支持:
|
|
27
|
+
- `query` 或 `q`
|
|
28
|
+
- `country` 或 `gl`
|
|
29
|
+
- `language` 或 `hl`
|
|
30
|
+
- `domain` 或 `google_domain`
|
|
31
|
+
- `includeHtml` 或 `include_html`
|
|
32
|
+
- `scrape_url` 里的 Header 转发请使用 `headers` + `header_mode`(`custom` / `extra` / `forward`)。
|
|
33
|
+
- 截图结果会以 MCP 图片内容返回,而不是单纯的 base64 文本。
|
|
34
|
+
- `scrape_url` 在未启用 ReturnJSON 时默认使用 `output="markdown"`,更适合 LLM 读取;如果你想更贴近原始 HTTP API 的行为,请手动设置 `output="raw"`。
|
|
18
35
|
|
|
19
36
|
## 安装
|
|
20
37
|
|
|
21
|
-
###
|
|
22
|
-
|
|
23
|
-
在终端中运行以下命令:
|
|
38
|
+
### 快速安装
|
|
24
39
|
|
|
25
40
|
```bash
|
|
26
41
|
claude mcp add-json scrape-do --scope user '{
|
|
@@ -28,13 +43,11 @@ claude mcp add-json scrape-do --scope user '{
|
|
|
28
43
|
"command": "npx",
|
|
29
44
|
"args": ["-y", "scrape-do-mcp"],
|
|
30
45
|
"env": {
|
|
31
|
-
"SCRAPE_DO_TOKEN": "
|
|
46
|
+
"SCRAPE_DO_TOKEN": "YOUR_TOKEN_HERE"
|
|
32
47
|
}
|
|
33
48
|
}'
|
|
34
49
|
```
|
|
35
50
|
|
|
36
|
-
将 `你的Token` 替换为你在 https://app.scrape.do 获取的 API Token。
|
|
37
|
-
|
|
38
51
|
### Claude Desktop
|
|
39
52
|
|
|
40
53
|
添加到 `~/.claude.json`:
|
|
@@ -46,181 +59,57 @@ claude mcp add-json scrape-do --scope user '{
|
|
|
46
59
|
"command": "npx",
|
|
47
60
|
"args": ["-y", "scrape-do-mcp"],
|
|
48
61
|
"env": {
|
|
49
|
-
"SCRAPE_DO_TOKEN": "
|
|
62
|
+
"SCRAPE_DO_TOKEN": "YOUR_TOKEN_HERE"
|
|
50
63
|
}
|
|
51
64
|
}
|
|
52
65
|
}
|
|
53
66
|
}
|
|
54
67
|
```
|
|
55
68
|
|
|
56
|
-
|
|
57
|
-
|
|
58
|
-
## 使用方法
|
|
59
|
-
|
|
60
|
-
### scrape_url
|
|
61
|
-
|
|
62
|
-
抓取任意网页并获取 Markdown 内容。
|
|
63
|
-
|
|
64
|
-
```typescript
|
|
65
|
-
// 完整参数
|
|
66
|
-
{
|
|
67
|
-
// 必需
|
|
68
|
-
url: string, // 要抓取的网址
|
|
69
|
-
|
|
70
|
-
// 代理和渲染
|
|
71
|
-
render_js?: boolean, // 渲染 JavaScript(默认 false)
|
|
72
|
-
super_proxy?: boolean, // 使用住宅/移动代理(消耗 10 积分)
|
|
73
|
-
geoCode?: string, // 国家代码(如 'us', 'cn', 'gb')
|
|
74
|
-
regionalGeoCode?: string, // 区域(如 'asia', 'europe')
|
|
75
|
-
device?: "desktop" | "mobile" | "tablet", // 设备类型
|
|
76
|
-
sessionId?: number, // 保持相同 IP 的会话
|
|
77
|
-
|
|
78
|
-
// 超时和重试
|
|
79
|
-
timeout?: number, // 最大超时时间(毫秒,默认 60000)
|
|
80
|
-
retryTimeout?: number, // 重试超时(毫秒)
|
|
81
|
-
disableRetry?: boolean, // 禁用自动重试
|
|
82
|
-
|
|
83
|
-
// 输出格式
|
|
84
|
-
output?: "markdown" | "raw", // 输出格式(默认 markdown)
|
|
85
|
-
returnJSON?: boolean, // 以 JSON 形式返回网络请求
|
|
86
|
-
transparentResponse?: boolean, // 返回原始响应
|
|
87
|
-
|
|
88
|
-
// 截图
|
|
89
|
-
screenshot?: boolean, // 截图(PNG)
|
|
90
|
-
fullScreenShot?: boolean, // 全页截图
|
|
91
|
-
particularScreenShot?: string, // 元素截图(CSS 选择器)
|
|
92
|
-
|
|
93
|
-
// 浏览器控制
|
|
94
|
-
waitSelector?: string, // 等待元素(CSS 选择器)
|
|
95
|
-
customWait?: number, // 加载后等待时间(毫秒)
|
|
96
|
-
waitUntil?: "domcontentloaded" | "load" | "networkidle" | "networkidle0" | "networkidle2",
|
|
97
|
-
width?: number, // 视口宽度(默认 1920)
|
|
98
|
-
height?: number, // 视口高度(默认 1080)
|
|
99
|
-
blockResources?: boolean, // 阻止 CSS/图片/字体(默认 true)
|
|
100
|
-
|
|
101
|
-
// 请求头和 Cookie
|
|
102
|
-
customHeaders?: boolean, // 处理所有请求头
|
|
103
|
-
extraHeaders?: boolean, // 添加额外请求头
|
|
104
|
-
forwardHeaders?: boolean, // 转发你的请求头
|
|
105
|
-
setCookies?: string, // 设置 Cookie(格式:'name=value; name2=value2')
|
|
106
|
-
pureCookies?: boolean, // 返回原始 Cookie
|
|
107
|
-
|
|
108
|
-
// 其他
|
|
109
|
-
disableRedirection?: boolean, // 禁用重定向
|
|
110
|
-
callback?: string // Webhook URL 异步接收结果
|
|
111
|
-
}
|
|
112
|
-
```
|
|
113
|
-
|
|
114
|
-
### google_search
|
|
115
|
-
|
|
116
|
-
搜索 Google 并获取结构化结果。
|
|
117
|
-
|
|
118
|
-
```typescript
|
|
119
|
-
// 完整参数
|
|
120
|
-
{
|
|
121
|
-
// 必需
|
|
122
|
-
query: string, // 搜索关键词
|
|
123
|
-
|
|
124
|
-
// 搜索选项
|
|
125
|
-
country?: string, // 国家代码(默认 'us')
|
|
126
|
-
language?: string, // 界面语言(默认 'en')
|
|
127
|
-
domain?: string, // Google 域名(如 'com', 'co.uk')
|
|
128
|
-
page?: number, // 页码(默认 1)
|
|
129
|
-
num?: number, // 每页结果数(默认 10)
|
|
130
|
-
time_period?: "" | "last_hour" | "last_day" | "last_week" | "last_month" | "last_year",
|
|
131
|
-
device?: "desktop" | "mobile", // 设备类型
|
|
132
|
-
|
|
133
|
-
// 高级
|
|
134
|
-
includeHtml?: boolean // 在响应中包含原始 HTML
|
|
135
|
-
}
|
|
136
|
-
```
|
|
137
|
-
|
|
138
|
-
## 使用示例
|
|
139
|
-
|
|
140
|
-
### 抓取网页
|
|
141
|
-
```
|
|
142
|
-
请抓取 https://github.com 并给我主要内容(Markdown 格式)。
|
|
143
|
-
```
|
|
144
|
-
|
|
145
|
-
### Google 搜索
|
|
146
|
-
```
|
|
147
|
-
搜索 "2026 年最佳 Python Web 框架",返回前 5 个结果。
|
|
148
|
-
```
|
|
69
|
+
Token 获取地址:https://app.scrape.do
|
|
149
70
|
|
|
150
|
-
|
|
151
|
-
```
|
|
152
|
-
用中文搜索 "AI 新闻",限定为中国,过去一周的内容。
|
|
153
|
-
```
|
|
71
|
+
## 可用工具
|
|
154
72
|
|
|
155
|
-
|
|
156
|
-
|
|
157
|
-
|
|
158
|
-
|
|
159
|
-
|
|
73
|
+
| 工具 | 用途 |
|
|
74
|
+
|------|------|
|
|
75
|
+
| `scrape_url` | 主 Scrape.do 抓取 API |
|
|
76
|
+
| `google_search` | 结构化 Google 搜索结果 |
|
|
77
|
+
| `amazon_product` | Amazon PDP 结构化数据 |
|
|
78
|
+
| `amazon_offer_listing` | Amazon 全量卖家报价 |
|
|
79
|
+
| `amazon_search` | Amazon 搜索 / 类目结果 |
|
|
80
|
+
| `amazon_raw_html` | Amazon 原始 HTML |
|
|
81
|
+
| `async_create_job` | 创建 Async API 任务 |
|
|
82
|
+
| `async_get_job` | 查询 Async job 详情 |
|
|
83
|
+
| `async_get_task` | 查询 Async task 详情 |
|
|
84
|
+
| `async_list_jobs` | 列出 Async jobs |
|
|
85
|
+
| `async_cancel_job` | 取消 Async job |
|
|
86
|
+
| `async_get_account` | 查询 Async 账户 / 并发信息 |
|
|
87
|
+
| `proxy_mode_config` | 生成 Proxy Mode 配置 |
|
|
160
88
|
|
|
161
|
-
|
|
162
|
-
```
|
|
163
|
-
抓取 https://example.com 并返回原始 HTML 而不是 markdown。
|
|
164
|
-
```
|
|
89
|
+
## 示例提示词
|
|
165
90
|
|
|
166
|
-
|
|
167
|
-
|
|
168
|
-
用日本(geoCode: jp)的 IP 抓取 https://www.amazon.com/product/12345
|
|
91
|
+
```text
|
|
92
|
+
抓取 https://example.com,开启 render=true,并等待 #app 出现。
|
|
169
93
|
```
|
|
170
94
|
|
|
171
|
-
|
|
172
|
-
|
|
173
|
-
用移动设备抓取 https://example.com 来查看移动版页面。
|
|
95
|
+
```text
|
|
96
|
+
搜索 "open source MCP servers",并设置 google_domain=google.co.uk 与 lr=lang_en。
|
|
174
97
|
```
|
|
175
98
|
|
|
176
|
-
|
|
177
|
-
|
|
178
|
-
截取 https://example.com 的屏幕截图并返回图片。
|
|
179
|
-
```
|
|
180
|
-
|
|
181
|
-
### 等待元素加载
|
|
182
|
-
```
|
|
183
|
-
抓取 https://example.com 但先等待 id 为 "content" 的元素加载完成。
|
|
99
|
+
```text
|
|
100
|
+
获取 Amazon ASIN B0C7BKZ883 在美国 zipcode=10001 下的 PDP 数据。
|
|
184
101
|
```
|
|
185
102
|
|
|
186
|
-
|
|
187
|
-
|
|
188
|
-
使用会话 ID 12345 抓取 https://example.com 的多个页面,以保持相同的 IP。
|
|
103
|
+
```text
|
|
104
|
+
帮我为这 20 个 URL 创建一个异步抓取任务,并返回 job ID。
|
|
189
105
|
```
|
|
190
106
|
|
|
191
|
-
## 与其他工具对比
|
|
192
|
-
|
|
193
|
-
| 功能 | scrape-do-mcp | Firecrawl | Browserbase |
|
|
194
|
-
|------|--------------|-----------|-------------|
|
|
195
|
-
| Google 搜索 | ✅ | ❌ | ❌ |
|
|
196
|
-
| 免费积分 | 1,000 | 500 | 无 |
|
|
197
|
-
| 价格 | 按量付费 | $19+/月 | $15+/月 |
|
|
198
|
-
| MCP 原生 | ✅ | ✅ | ❌ |
|
|
199
|
-
| 配置难度 | 无需配置 | 需要 API key | 需要 API key + 浏览器 |
|
|
200
|
-
|
|
201
|
-
### 为什么选择 scrape-do-mcp?
|
|
202
|
-
|
|
203
|
-
- **零配置**:获取 Token 后即可立即使用
|
|
204
|
-
- **一体化**:网页抓取和 Google 搜索集于一个 MCP
|
|
205
|
-
- **反爬虫绕过**:自动处理 Cloudflare、WAF、CAPTCHA
|
|
206
|
-
- **成本效益**:按需付费,免费额度可用
|
|
207
|
-
|
|
208
|
-
## 积分消耗
|
|
209
|
-
|
|
210
|
-
| 工具 | 积分消耗 |
|
|
211
|
-
|------|---------|
|
|
212
|
-
| scrape_url(普通) | 1 积分/次 |
|
|
213
|
-
| scrape_url(super_proxy) | 10 积分/次 |
|
|
214
|
-
| google_search | 1 积分/次 |
|
|
215
|
-
|
|
216
|
-
**免费:每月 1,000 积分** - 无需信用卡:https://app.scrape.do
|
|
217
|
-
|
|
218
107
|
## 开发
|
|
219
108
|
|
|
220
109
|
```bash
|
|
221
110
|
npm install
|
|
222
111
|
npm run build
|
|
223
|
-
npm run dev
|
|
112
|
+
npm run dev
|
|
224
113
|
```
|
|
225
114
|
|
|
226
115
|
## 许可证
|
package/README.md
CHANGED
|
@@ -2,25 +2,40 @@
|
|
|
2
2
|
|
|
3
3
|
[中文文档](./README-ZH.md) | English
|
|
4
4
|
|
|
5
|
-
MCP
|
|
6
|
-
|
|
7
|
-
|
|
8
|
-
|
|
9
|
-
|
|
10
|
-
|
|
11
|
-
|
|
12
|
-
|
|
13
|
-
|
|
14
|
-
|
|
15
|
-
|
|
16
|
-
|
|
17
|
-
|
|
5
|
+
An MCP server that wraps Scrape.do's documented APIs in one package: the main scraping API, Google Search API, Amazon Scraper API, Async API, and a Proxy Mode configuration helper.
|
|
6
|
+
|
|
7
|
+
Official docs: https://scrape.do/documentation/
|
|
8
|
+
|
|
9
|
+
## Coverage
|
|
10
|
+
|
|
11
|
+
- `scrape_url`: Main Scrape.do API with JS rendering, geo-targeting, session persistence, screenshots, ReturnJSON, browser interactions, cookies, and header forwarding.
|
|
12
|
+
- `google_search`: Structured Google SERP API with `google_domain`, `location`, `uule`, `lr`, `cr`, `safe`, `nfpr`, `filter`, pagination, and optional raw HTML.
|
|
13
|
+
- `amazon_product`: Amazon PDP endpoint.
|
|
14
|
+
- `amazon_offer_listing`: Amazon offer listing endpoint.
|
|
15
|
+
- `amazon_search`: Amazon search/category endpoint.
|
|
16
|
+
- `amazon_raw_html`: Raw HTML Amazon endpoint with geo-targeting.
|
|
17
|
+
- `async_create_job`, `async_get_job`, `async_get_task`, `async_list_jobs`, `async_cancel_job`, `async_get_account`: Async API coverage.
|
|
18
|
+
- `proxy_mode_config`: Builds Proxy Mode connection details and parameter strings without exposing your token in tool output.
|
|
19
|
+
|
|
20
|
+
## Compatibility Notes
|
|
21
|
+
|
|
22
|
+
- `scrape_url` supports both MCP-friendly aliases and official parameter names:
|
|
23
|
+
- `render_js` or `render`
|
|
24
|
+
- `super_proxy` or `super`
|
|
25
|
+
- `screenshot` or `screenShot`
|
|
26
|
+
- `google_search` supports:
|
|
27
|
+
- `query` or `q`
|
|
28
|
+
- `country` or `gl`
|
|
29
|
+
- `language` or `hl`
|
|
30
|
+
- `domain` or `google_domain`
|
|
31
|
+
- `includeHtml` or `include_html`
|
|
32
|
+
- For header forwarding in `scrape_url`, pass `headers` plus `header_mode` (`custom`, `extra`, or `forward`).
|
|
33
|
+
- Screenshot responses are returned as MCP image content instead of plain base64 text.
|
|
34
|
+
- `scrape_url` defaults to `output="markdown"` when ReturnJSON is not used so the tool stays LLM-friendly. Set `output="raw"` if you want the raw API-style output.
|
|
18
35
|
|
|
19
36
|
## Installation
|
|
20
37
|
|
|
21
|
-
### Quick Install
|
|
22
|
-
|
|
23
|
-
Run this command in your terminal:
|
|
38
|
+
### Quick Install
|
|
24
39
|
|
|
25
40
|
```bash
|
|
26
41
|
claude mcp add-json scrape-do --scope user '{
|
|
@@ -33,11 +48,9 @@ claude mcp add-json scrape-do --scope user '{
|
|
|
33
48
|
}'
|
|
34
49
|
```
|
|
35
50
|
|
|
36
|
-
Replace `YOUR_TOKEN_HERE` with your Scrape.do API token from https://app.scrape.do
|
|
37
|
-
|
|
38
51
|
### Claude Desktop
|
|
39
52
|
|
|
40
|
-
Add to
|
|
53
|
+
Add this to `~/.claude.json`:
|
|
41
54
|
|
|
42
55
|
```json
|
|
43
56
|
{
|
|
@@ -46,183 +59,57 @@ Add to your `~/.claude.json`:
|
|
|
46
59
|
"command": "npx",
|
|
47
60
|
"args": ["-y", "scrape-do-mcp"],
|
|
48
61
|
"env": {
|
|
49
|
-
"SCRAPE_DO_TOKEN": "
|
|
62
|
+
"SCRAPE_DO_TOKEN": "YOUR_TOKEN_HERE"
|
|
50
63
|
}
|
|
51
64
|
}
|
|
52
65
|
}
|
|
53
66
|
}
|
|
54
67
|
```
|
|
55
68
|
|
|
56
|
-
Get your
|
|
69
|
+
Get your token at https://app.scrape.do
|
|
57
70
|
|
|
58
|
-
##
|
|
59
|
-
|
|
60
|
-
### scrape_url
|
|
61
|
-
|
|
62
|
-
Scrape any webpage and get content as Markdown.
|
|
63
|
-
|
|
64
|
-
```typescript
|
|
65
|
-
// All Parameters
|
|
66
|
-
{
|
|
67
|
-
// Required
|
|
68
|
-
url: string, // Target URL to scrape
|
|
69
|
-
|
|
70
|
-
// Proxy & Rendering
|
|
71
|
-
render_js?: boolean, // Render JavaScript (default: false)
|
|
72
|
-
super_proxy?: boolean, // Use residential/mobile proxies (costs 10 credits)
|
|
73
|
-
geoCode?: string, // Country code (e.g., 'us', 'cn', 'gb')
|
|
74
|
-
regionalGeoCode?: string, // Region (e.g., 'asia', 'europe')
|
|
75
|
-
device?: "desktop" | "mobile" | "tablet", // Device type
|
|
76
|
-
sessionId?: number, // Keep same IP for session
|
|
77
|
-
|
|
78
|
-
// Timeout & Retry
|
|
79
|
-
timeout?: number, // Max timeout in ms (default: 60000)
|
|
80
|
-
retryTimeout?: number, // Retry timeout in ms
|
|
81
|
-
disableRetry?: boolean, // Disable auto retry
|
|
82
|
-
|
|
83
|
-
// Output Format
|
|
84
|
-
output?: "markdown" | "raw", // Output format (default: markdown)
|
|
85
|
-
returnJSON?: boolean, // Return network requests as JSON
|
|
86
|
-
transparentResponse?: boolean, // Return pure response
|
|
87
|
-
|
|
88
|
-
// Screenshot
|
|
89
|
-
screenshot?: boolean, // Take screenshot (PNG)
|
|
90
|
-
fullScreenShot?: boolean, // Full page screenshot
|
|
91
|
-
particularScreenShot?: string, // Screenshot of element (CSS selector)
|
|
92
|
-
|
|
93
|
-
// Browser Control
|
|
94
|
-
waitSelector?: string, // Wait for element (CSS selector)
|
|
95
|
-
customWait?: number, // Wait time after load (ms)
|
|
96
|
-
waitUntil?: "domcontentloaded" | "load" | "networkidle" | "networkidle0" | "networkidle2",
|
|
97
|
-
width?: number, // Viewport width (default: 1920)
|
|
98
|
-
height?: number, // Viewport height (default: 1080)
|
|
99
|
-
blockResources?: boolean, // Block CSS/images/fonts (default: true)
|
|
100
|
-
|
|
101
|
-
// Headers & Cookies
|
|
102
|
-
customHeaders?: boolean, // Handle all headers
|
|
103
|
-
extraHeaders?: boolean, // Add extra headers
|
|
104
|
-
forwardHeaders?: boolean, // Forward your headers
|
|
105
|
-
setCookies?: string, // Set cookies ('name=value; name2=value2')
|
|
106
|
-
pureCookies?: boolean, // Return original cookies
|
|
107
|
-
|
|
108
|
-
// Other
|
|
109
|
-
disableRedirection?: boolean, // Disable redirect
|
|
110
|
-
callback?: string // Webhook URL for async results
|
|
111
|
-
}
|
|
112
|
-
```
|
|
113
|
-
|
|
114
|
-
### google_search
|
|
115
|
-
|
|
116
|
-
Search Google and get structured results.
|
|
71
|
+
## Available Tools
|
|
117
72
|
|
|
118
|
-
|
|
119
|
-
|
|
120
|
-
|
|
121
|
-
|
|
122
|
-
|
|
123
|
-
|
|
124
|
-
|
|
125
|
-
|
|
126
|
-
|
|
127
|
-
|
|
128
|
-
|
|
129
|
-
|
|
130
|
-
|
|
131
|
-
|
|
132
|
-
|
|
133
|
-
// Advanced
|
|
134
|
-
includeHtml?: boolean // Include raw HTML in response
|
|
135
|
-
}
|
|
136
|
-
```
|
|
73
|
+
| Tool | Purpose |
|
|
74
|
+
|------|---------|
|
|
75
|
+
| `scrape_url` | Main Scrape.do scraping API wrapper |
|
|
76
|
+
| `google_search` | Structured Google search results |
|
|
77
|
+
| `amazon_product` | Amazon PDP structured data |
|
|
78
|
+
| `amazon_offer_listing` | Amazon seller offers |
|
|
79
|
+
| `amazon_search` | Amazon keyword/category results |
|
|
80
|
+
| `amazon_raw_html` | Raw Amazon HTML with geo-targeting |
|
|
81
|
+
| `async_create_job` | Create Async API jobs |
|
|
82
|
+
| `async_get_job` | Fetch Async job details |
|
|
83
|
+
| `async_get_task` | Fetch Async task details |
|
|
84
|
+
| `async_list_jobs` | List Async jobs |
|
|
85
|
+
| `async_cancel_job` | Cancel Async jobs |
|
|
86
|
+
| `async_get_account` | Fetch Async account/concurrency info |
|
|
87
|
+
| `proxy_mode_config` | Generate Proxy Mode configuration |
|
|
137
88
|
|
|
138
89
|
## Example Prompts
|
|
139
90
|
|
|
140
|
-
|
|
141
|
-
|
|
142
|
-
### Scrape a Website
|
|
143
|
-
```
|
|
144
|
-
Please scrape https://github.com and give me the main content as markdown.
|
|
91
|
+
```text
|
|
92
|
+
Scrape https://example.com with render=true and wait for #app.
|
|
145
93
|
```
|
|
146
94
|
|
|
147
|
-
|
|
148
|
-
|
|
149
|
-
Search Google for "best Python web frameworks 2026" and return the top 5 results.
|
|
95
|
+
```text
|
|
96
|
+
Search Google for "open source MCP servers" with google_domain=google.co.uk and lr=lang_en.
|
|
150
97
|
```
|
|
151
98
|
|
|
152
|
-
|
|
153
|
-
|
|
154
|
-
Search for "AI news" in Chinese, from China, last week.
|
|
99
|
+
```text
|
|
100
|
+
Get the Amazon PDP for ASIN B0C7BKZ883 in the US with zipcode 10001.
|
|
155
101
|
```
|
|
156
102
|
|
|
157
|
-
|
|
158
|
-
|
|
159
|
-
Scrape this React Single Page Application: https://example-spa.com
|
|
160
|
-
Use render_js=true to get the fully rendered content.
|
|
103
|
+
```text
|
|
104
|
+
Create an async job for these 20 URLs and give me the job ID.
|
|
161
105
|
```
|
|
162
106
|
|
|
163
|
-
### Get Raw HTML
|
|
164
|
-
```
|
|
165
|
-
Scrape https://example.com and return raw HTML instead of markdown.
|
|
166
|
-
```
|
|
167
|
-
|
|
168
|
-
### Geo-targeting
|
|
169
|
-
```
|
|
170
|
-
Scrape https://www.amazon.com/product/12345 as if I'm in Japan (geoCode: jp)
|
|
171
|
-
```
|
|
172
|
-
|
|
173
|
-
### Mobile Device
|
|
174
|
-
```
|
|
175
|
-
Scrape https://example.com using a mobile device to see the mobile version.
|
|
176
|
-
```
|
|
177
|
-
|
|
178
|
-
### Take Screenshot
|
|
179
|
-
```
|
|
180
|
-
Take a screenshot of https://example.com and return the image.
|
|
181
|
-
```
|
|
182
|
-
|
|
183
|
-
### Wait for Element
|
|
184
|
-
```
|
|
185
|
-
Scrape https://example.com but wait for the element with id "content" to load first.
|
|
186
|
-
```
|
|
187
|
-
|
|
188
|
-
### Session Persistence
|
|
189
|
-
```
|
|
190
|
-
Scrape multiple pages of https://example.com using sessionId 12345 to maintain the same IP.
|
|
191
|
-
```
|
|
192
|
-
|
|
193
|
-
## Comparison with Alternatives
|
|
194
|
-
|
|
195
|
-
| Feature | scrape-do-mcp | Firecrawl | Browserbase |
|
|
196
|
-
|---------|--------------|-----------|-------------|
|
|
197
|
-
| Google Search | ✅ | ❌ | ❌ |
|
|
198
|
-
| Free Credits | 1,000 | 500 | None |
|
|
199
|
-
| Pricing | Pay per use | $19+/mo | $15+/mo |
|
|
200
|
-
| MCP Native | ✅ | ✅ | ❌ |
|
|
201
|
-
| Setup Required | None | API key | API key + browser |
|
|
202
|
-
|
|
203
|
-
### Why scrape-do-mcp?
|
|
204
|
-
|
|
205
|
-
- **Zero setup**: Just get a token and use immediately
|
|
206
|
-
- **All-in-one**: Both web scraping AND Google search in one MCP
|
|
207
|
-
- **Anti-bot bypass**: Automatically handles Cloudflare, WAFs, CAPTCHAs
|
|
208
|
-
- **Cost-effective**: Pay only for what you use, free tier available
|
|
209
|
-
|
|
210
|
-
## Credit Usage
|
|
211
|
-
|
|
212
|
-
| Tool | Credit Cost |
|
|
213
|
-
|------|-------------|
|
|
214
|
-
| scrape_url (regular) | 1 credit/request |
|
|
215
|
-
| scrape_url (super_proxy) | 10 credits/request |
|
|
216
|
-
| google_search | 1 credit/request |
|
|
217
|
-
|
|
218
|
-
**Free: 1,000 credits/month** - No credit card required: https://app.scrape.do
|
|
219
|
-
|
|
220
107
|
## Development
|
|
221
108
|
|
|
222
109
|
```bash
|
|
223
110
|
npm install
|
|
224
111
|
npm run build
|
|
225
|
-
npm run dev
|
|
112
|
+
npm run dev
|
|
226
113
|
```
|
|
227
114
|
|
|
228
115
|
## License
|