scrape-do-mcp 0.1.2 → 0.1.4
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- package/README-ZH.md +146 -0
- package/README.md +69 -20
- package/dist/index.js +3 -46
- package/package.json +3 -2
package/README-ZH.md
ADDED
|
@@ -0,0 +1,146 @@
|
|
|
1
|
+
# scrape-do-mcp
|
|
2
|
+
|
|
3
|
+
Scrape.do 网页抓取和 Google 搜索 MCP 服务器 - 支持反机器人保护
|
|
4
|
+
|
|
5
|
+
## 功能特点
|
|
6
|
+
|
|
7
|
+
- **scrape_url**: 抓取任意网页并返回 Markdown 格式内容。自动绕过 Cloudflare、WAF、CAPTCHA 和反爬虫保护。支持 JavaScript 渲染页面。
|
|
8
|
+
- **google_search**: 搜索 Google 并返回结构化的 SERP 结果 JSON。包含自然搜索结果、知识图谱、本地商家、新闻、相关问题(People Also Ask)等。
|
|
9
|
+
|
|
10
|
+
## 安装
|
|
11
|
+
|
|
12
|
+
### 快速安装(推荐)
|
|
13
|
+
|
|
14
|
+
在终端中运行以下命令:
|
|
15
|
+
|
|
16
|
+
```bash
|
|
17
|
+
claude mcp add-json scrape-do --scope user '{
|
|
18
|
+
"type": "stdio",
|
|
19
|
+
"command": "npx",
|
|
20
|
+
"args": ["-y", "scrape-do-mcp"],
|
|
21
|
+
"env": {
|
|
22
|
+
"SCRAPE_DO_TOKEN": "你的Token"
|
|
23
|
+
}
|
|
24
|
+
}'
|
|
25
|
+
```
|
|
26
|
+
|
|
27
|
+
将 `你的Token` 替换为你在 https://app.scrape.do 获取的 API Token。
|
|
28
|
+
|
|
29
|
+
### Claude Desktop
|
|
30
|
+
|
|
31
|
+
添加到 `~/.claude.json`:
|
|
32
|
+
|
|
33
|
+
```json
|
|
34
|
+
{
|
|
35
|
+
"mcpServers": {
|
|
36
|
+
"scrape-do": {
|
|
37
|
+
"command": "npx",
|
|
38
|
+
"args": ["-y", "scrape-do-mcp"],
|
|
39
|
+
"env": {
|
|
40
|
+
"SCRAPE_DO_TOKEN": "你的Token"
|
|
41
|
+
}
|
|
42
|
+
}
|
|
43
|
+
}
|
|
44
|
+
}
|
|
45
|
+
```
|
|
46
|
+
|
|
47
|
+
获取免费 API Token:https://app.scrape.do
|
|
48
|
+
|
|
49
|
+
## 使用方法
|
|
50
|
+
|
|
51
|
+
### scrape_url
|
|
52
|
+
|
|
53
|
+
抓取任意网页并获取 Markdown 内容。
|
|
54
|
+
|
|
55
|
+
```typescript
|
|
56
|
+
// 参数
|
|
57
|
+
{
|
|
58
|
+
url: string, // 要抓取的网址
|
|
59
|
+
render_js?: boolean, // 渲染 JavaScript(默认 false)
|
|
60
|
+
super_proxy?: boolean, // 使用住宅代理(消耗 10 积分,默认 false)
|
|
61
|
+
output?: "markdown" | "raw" // 输出格式(默认 markdown)
|
|
62
|
+
}
|
|
63
|
+
```
|
|
64
|
+
|
|
65
|
+
### google_search
|
|
66
|
+
|
|
67
|
+
搜索 Google 并获取结构化结果。
|
|
68
|
+
|
|
69
|
+
```typescript
|
|
70
|
+
// 参数
|
|
71
|
+
{
|
|
72
|
+
query: string, // 搜索关键词
|
|
73
|
+
country?: string, // 国家代码(默认 "us")
|
|
74
|
+
language?: string, // 界面语言(默认 "en")
|
|
75
|
+
page?: number, // 页码(默认 1)
|
|
76
|
+
time_period?: "" | "last_hour" | "last_day" | "last_week" | "last_month" | "last_year",
|
|
77
|
+
device?: "desktop" | "mobile" // 设备类型(默认 desktop)
|
|
78
|
+
}
|
|
79
|
+
```
|
|
80
|
+
|
|
81
|
+
## 使用示例
|
|
82
|
+
|
|
83
|
+
### 抓取网页
|
|
84
|
+
```
|
|
85
|
+
请抓取 https://github.com 并给我主要内容(Markdown 格式)。
|
|
86
|
+
```
|
|
87
|
+
|
|
88
|
+
### Google 搜索
|
|
89
|
+
```
|
|
90
|
+
搜索 "2026 年最佳 Python Web 框架",返回前 5 个结果。
|
|
91
|
+
```
|
|
92
|
+
|
|
93
|
+
### 带筛选条件的搜索
|
|
94
|
+
```
|
|
95
|
+
用中文搜索 "AI 新闻",限定为中国,过去一周的内容。
|
|
96
|
+
```
|
|
97
|
+
|
|
98
|
+
### JavaScript 渲染
|
|
99
|
+
```
|
|
100
|
+
抓取这个 React 单页应用:https://example-spa.com
|
|
101
|
+
使用 render_js=true 获取完整渲染内容。
|
|
102
|
+
```
|
|
103
|
+
|
|
104
|
+
### 获取原始 HTML
|
|
105
|
+
```
|
|
106
|
+
抓取 https://example.com 并返回原始 HTML 而不是 markdown。
|
|
107
|
+
```
|
|
108
|
+
|
|
109
|
+
## 与其他工具对比
|
|
110
|
+
|
|
111
|
+
| 功能 | scrape-do-mcp | Firecrawl | Browserbase |
|
|
112
|
+
|------|--------------|-----------|-------------|
|
|
113
|
+
| Google 搜索 | ✅ | ❌ | ❌ |
|
|
114
|
+
| 免费积分 | 1,000 | 500 | 无 |
|
|
115
|
+
| 价格 | 按量付费 | $19+/月 | $15+/月 |
|
|
116
|
+
| MCP 原生 | ✅ | ✅ | ❌ |
|
|
117
|
+
| 配置难度 | 无需配置 | 需要 API key | 需要 API key + 浏览器 |
|
|
118
|
+
|
|
119
|
+
### 为什么选择 scrape-do-mcp?
|
|
120
|
+
|
|
121
|
+
- **零配置**:获取 Token 后即可立即使用
|
|
122
|
+
- **一体化**:网页抓取和 Google 搜索集于一个 MCP
|
|
123
|
+
- **反爬虫绕过**:自动处理 Cloudflare、WAF、CAPTCHA
|
|
124
|
+
- **成本效益**:按需付费,免费额度可用
|
|
125
|
+
|
|
126
|
+
## 积分消耗
|
|
127
|
+
|
|
128
|
+
| 工具 | 积分消耗 |
|
|
129
|
+
|------|---------|
|
|
130
|
+
| scrape_url(普通) | 1 积分/次 |
|
|
131
|
+
| scrape_url(super_proxy) | 10 积分/次 |
|
|
132
|
+
| google_search | 1 积分/次 |
|
|
133
|
+
|
|
134
|
+
注册即送 **1,000 积分**:https://app.scrape.do
|
|
135
|
+
|
|
136
|
+
## 开发
|
|
137
|
+
|
|
138
|
+
```bash
|
|
139
|
+
npm install
|
|
140
|
+
npm run build
|
|
141
|
+
npm run dev # 开发模式运行
|
|
142
|
+
```
|
|
143
|
+
|
|
144
|
+
## 许可证
|
|
145
|
+
|
|
146
|
+
MIT
|
package/README.md
CHANGED
|
@@ -9,7 +9,24 @@ MCP Server for Scrape.do - Web Scraping & Google Search with anti-bot bypass
|
|
|
9
9
|
|
|
10
10
|
## Installation
|
|
11
11
|
|
|
12
|
-
###
|
|
12
|
+
### Quick Install (Recommended)
|
|
13
|
+
|
|
14
|
+
Run this command in your terminal:
|
|
15
|
+
|
|
16
|
+
```bash
|
|
17
|
+
claude mcp add-json scrape-do --scope user '{
|
|
18
|
+
"type": "stdio",
|
|
19
|
+
"command": "npx",
|
|
20
|
+
"args": ["-y", "scrape-do-mcp"],
|
|
21
|
+
"env": {
|
|
22
|
+
"SCRAPE_DO_TOKEN": "YOUR_TOKEN_HERE"
|
|
23
|
+
}
|
|
24
|
+
}'
|
|
25
|
+
```
|
|
26
|
+
|
|
27
|
+
Replace `YOUR_TOKEN_HERE` with your Scrape.do API token from https://app.scrape.do
|
|
28
|
+
|
|
29
|
+
### Claude Desktop
|
|
13
30
|
|
|
14
31
|
Add to your `~/.claude.json`:
|
|
15
32
|
|
|
@@ -29,29 +46,12 @@ Add to your `~/.claude.json`:
|
|
|
29
46
|
|
|
30
47
|
Get your free API token at: https://app.scrape.do
|
|
31
48
|
|
|
32
|
-
### Smithery.ai
|
|
33
|
-
|
|
34
|
-
Published on [Smithery.ai](https://smithery.ai) - Search for "scrape-do" to install.
|
|
35
|
-
|
|
36
|
-
### HTTP Server Mode
|
|
37
|
-
|
|
38
|
-
The server supports both STDIO and HTTP modes:
|
|
39
|
-
|
|
40
|
-
- **STDIO mode** (default): For local Claude Code / Claude Desktop usage
|
|
41
|
-
- **HTTP mode**: For Smithery托管或 custom HTTP deployment
|
|
42
|
-
|
|
43
|
-
```bash
|
|
44
|
-
# HTTP mode
|
|
45
|
-
TRANSPORT=http PORT=3000 SCRAPE_DO_TOKEN=your_token npm start
|
|
46
|
-
|
|
47
|
-
# Health check
|
|
48
|
-
curl http://localhost:3000/health
|
|
49
|
-
```
|
|
50
|
-
|
|
51
49
|
## Usage
|
|
52
50
|
|
|
53
51
|
### scrape_url
|
|
54
52
|
|
|
53
|
+
Scrape any webpage and get content as Markdown.
|
|
54
|
+
|
|
55
55
|
```typescript
|
|
56
56
|
// Parameters
|
|
57
57
|
{
|
|
@@ -64,6 +64,8 @@ curl http://localhost:3000/health
|
|
|
64
64
|
|
|
65
65
|
### google_search
|
|
66
66
|
|
|
67
|
+
Search Google and get structured results.
|
|
68
|
+
|
|
67
69
|
```typescript
|
|
68
70
|
// Parameters
|
|
69
71
|
{
|
|
@@ -76,6 +78,53 @@ curl http://localhost:3000/health
|
|
|
76
78
|
}
|
|
77
79
|
```
|
|
78
80
|
|
|
81
|
+
## Example Prompts
|
|
82
|
+
|
|
83
|
+
Here are some prompts you can use to invoke the tools:
|
|
84
|
+
|
|
85
|
+
### Scrape a Website
|
|
86
|
+
```
|
|
87
|
+
Please scrape https://github.com and give me the main content as markdown.
|
|
88
|
+
```
|
|
89
|
+
|
|
90
|
+
### Search Google
|
|
91
|
+
```
|
|
92
|
+
Search Google for "best Python web frameworks 2026" and return the top 5 results.
|
|
93
|
+
```
|
|
94
|
+
|
|
95
|
+
### Search with Filters
|
|
96
|
+
```
|
|
97
|
+
Search for "AI news" in Chinese, from China, last week.
|
|
98
|
+
```
|
|
99
|
+
|
|
100
|
+
### JavaScript Rendering
|
|
101
|
+
```
|
|
102
|
+
Scrape this React Single Page Application: https://example-spa.com
|
|
103
|
+
Use render_js=true to get the fully rendered content.
|
|
104
|
+
```
|
|
105
|
+
|
|
106
|
+
### Get Raw HTML
|
|
107
|
+
```
|
|
108
|
+
Scrape https://example.com and return raw HTML instead of markdown.
|
|
109
|
+
```
|
|
110
|
+
|
|
111
|
+
## Comparison with Alternatives
|
|
112
|
+
|
|
113
|
+
| Feature | scrape-do-mcp | Firecrawl | Browserbase |
|
|
114
|
+
|---------|--------------|-----------|-------------|
|
|
115
|
+
| Google Search | ✅ | ❌ | ❌ |
|
|
116
|
+
| Free Credits | 1,000 | 500 | None |
|
|
117
|
+
| Pricing | Pay per use | $19+/mo | $15+/mo |
|
|
118
|
+
| MCP Native | ✅ | ✅ | ❌ |
|
|
119
|
+
| Setup Required | None | API key | API key + browser |
|
|
120
|
+
|
|
121
|
+
### Why scrape-do-mcp?
|
|
122
|
+
|
|
123
|
+
- **Zero setup**: Just get a token and use immediately
|
|
124
|
+
- **All-in-one**: Both web scraping AND Google search in one MCP
|
|
125
|
+
- **Anti-bot bypass**: Automatically handles Cloudflare, WAFs, CAPTCHAs
|
|
126
|
+
- **Cost-effective**: Pay only for what you use, free tier available
|
|
127
|
+
|
|
79
128
|
## Credit Usage
|
|
80
129
|
|
|
81
130
|
| Tool | Credit Cost |
|
package/dist/index.js
CHANGED
|
@@ -6,16 +6,13 @@ var __importDefault = (this && this.__importDefault) || function (mod) {
|
|
|
6
6
|
Object.defineProperty(exports, "__esModule", { value: true });
|
|
7
7
|
const mcp_js_1 = require("@modelcontextprotocol/sdk/server/mcp.js");
|
|
8
8
|
const stdio_js_1 = require("@modelcontextprotocol/sdk/server/stdio.js");
|
|
9
|
-
const streamableHttp_js_1 = require("@modelcontextprotocol/sdk/server/streamableHttp.js");
|
|
10
9
|
const zod_1 = require("zod");
|
|
11
10
|
const axios_1 = __importDefault(require("axios"));
|
|
12
|
-
const http_1 = __importDefault(require("http"));
|
|
13
11
|
const SCRAPE_DO_TOKEN = process.env.SCRAPE_DO_TOKEN || "";
|
|
14
12
|
const SCRAPE_API_BASE = "https://api.scrape.do";
|
|
15
|
-
const HTTP_PORT = process.env.PORT || process.env.HTTP_PORT || 3000;
|
|
16
13
|
const server = new mcp_js_1.McpServer({
|
|
17
14
|
name: "scrape-do-mcp",
|
|
18
|
-
version: "0.1.
|
|
15
|
+
version: "0.1.3",
|
|
19
16
|
});
|
|
20
17
|
// ─── Tool 1: scrape_url ──────────────────────────────────────────────────────
|
|
21
18
|
server.tool("scrape_url", "Scrape any webpage and return its content as Markdown. Automatically bypasses Cloudflare, WAFs, CAPTCHAs, and anti-bot protection. Supports JavaScript-rendered pages.", {
|
|
@@ -97,47 +94,7 @@ server.tool("google_search", "Search Google and return structured SERP results a
|
|
|
97
94
|
});
|
|
98
95
|
// ─── Start Server ────────────────────────────────────────────────────────────
|
|
99
96
|
async function main() {
|
|
100
|
-
const
|
|
101
|
-
|
|
102
|
-
console.error(`Starting Streamable HTTP server on port ${HTTP_PORT}...`);
|
|
103
|
-
const transport = new streamableHttp_js_1.StreamableHTTPServerTransport({
|
|
104
|
-
sessionIdGenerator: () => Math.random().toString(36).substring(2, 15),
|
|
105
|
-
});
|
|
106
|
-
await server.connect(transport);
|
|
107
|
-
const serverInstance = http_1.default.createServer();
|
|
108
|
-
serverInstance.on("request", async (req, res) => {
|
|
109
|
-
// Handle CORS
|
|
110
|
-
res.setHeader("Access-Control-Allow-Origin", "*");
|
|
111
|
-
res.setHeader("Access-Control-Allow-Methods", "GET, POST, OPTIONS");
|
|
112
|
-
res.setHeader("Access-Control-Allow-Headers", "Content-Type");
|
|
113
|
-
if (req.method === "OPTIONS") {
|
|
114
|
-
res.writeHead(204);
|
|
115
|
-
res.end();
|
|
116
|
-
return;
|
|
117
|
-
}
|
|
118
|
-
// Health check
|
|
119
|
-
if (req.url === "/health") {
|
|
120
|
-
res.writeHead(200, { "Content-Type": "application/json" });
|
|
121
|
-
res.end(JSON.stringify({ status: "ok", name: "scrape-do-mcp", version: "0.1.1" }));
|
|
122
|
-
return;
|
|
123
|
-
}
|
|
124
|
-
// MCP endpoint
|
|
125
|
-
if (req.url === "/" || req.url?.startsWith("/mcp")) {
|
|
126
|
-
await transport.handleRequest(req, res);
|
|
127
|
-
return;
|
|
128
|
-
}
|
|
129
|
-
res.writeHead(404);
|
|
130
|
-
res.end("Not found");
|
|
131
|
-
});
|
|
132
|
-
serverInstance.listen(parseInt(String(HTTP_PORT), 10), () => {
|
|
133
|
-
console.error(`MCP server running on http://localhost:${HTTP_PORT}`);
|
|
134
|
-
});
|
|
135
|
-
}
|
|
136
|
-
else {
|
|
137
|
-
// Default to stdio mode
|
|
138
|
-
console.error("Starting STDIO server...");
|
|
139
|
-
const transport = new stdio_js_1.StdioServerTransport();
|
|
140
|
-
await server.connect(transport);
|
|
141
|
-
}
|
|
97
|
+
const transport = new stdio_js_1.StdioServerTransport();
|
|
98
|
+
await server.connect(transport);
|
|
142
99
|
}
|
|
143
100
|
main().catch(console.error);
|
package/package.json
CHANGED
|
@@ -1,6 +1,6 @@
|
|
|
1
1
|
{
|
|
2
2
|
"name": "scrape-do-mcp",
|
|
3
|
-
"version": "0.1.
|
|
3
|
+
"version": "0.1.4",
|
|
4
4
|
"description": "MCP Server for Scrape.do - Web Scraping & Google Search with anti-bot bypass",
|
|
5
5
|
"main": "dist/index.js",
|
|
6
6
|
"bin": {
|
|
@@ -25,7 +25,8 @@
|
|
|
25
25
|
},
|
|
26
26
|
"files": [
|
|
27
27
|
"dist",
|
|
28
|
-
"README.md"
|
|
28
|
+
"README.md",
|
|
29
|
+
"README-ZH.md"
|
|
29
30
|
],
|
|
30
31
|
"engines": {
|
|
31
32
|
"node": ">=18.0.0"
|