mcp-web-reader 2.0.0
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- package/README.md +255 -0
- package/dist/index.d.ts +2 -0
- package/dist/index.d.ts.map +1 -0
- package/dist/index.js +598 -0
- package/package.json +45 -0
package/README.md
ADDED
|
@@ -0,0 +1,255 @@
|
|
|
1
|
+
# MCP Web Reader
|
|
2
|
+
|
|
3
|
+
一个强大的 MCP (Model Context Protocol) 服务器,让 Claude 和其他大语言模型能够读取和解析网页内容。支持突破访问限制,轻松获取微信文章、时代杂志等受保护内容。
|
|
4
|
+
|
|
5
|
+
## 功能特点
|
|
6
|
+
|
|
7
|
+
- 🚀 **三引擎支持**:集成 Jina Reader API、本地解析器和 Playwright 浏览器
|
|
8
|
+
- 🔄 **智能降级**:Jina Reader → 本地解析 → Playwright 浏览器三层自动切换
|
|
9
|
+
- 🌐 **突破限制**:使用 Playwright 处理 Cloudflare、验证码等访问限制
|
|
10
|
+
- 📦 **批量处理**:支持同时获取多个 URL
|
|
11
|
+
- 🎯 **灵活控制**:可选择强制使用特定解析方式
|
|
12
|
+
- 📝 **Markdown 输出**:自动转换为清晰的 Markdown 格式
|
|
13
|
+
|
|
14
|
+
## 安装
|
|
15
|
+
|
|
16
|
+
### 方法 1:从源码安装
|
|
17
|
+
|
|
18
|
+
```bash
|
|
19
|
+
# 克隆仓库
|
|
20
|
+
git clone https://github.com/zacfire/mcp-web-reader.git
|
|
21
|
+
cd mcp-web-reader
|
|
22
|
+
|
|
23
|
+
# 安装依赖
|
|
24
|
+
npm install
|
|
25
|
+
|
|
26
|
+
# 构建项目
|
|
27
|
+
npm run build
|
|
28
|
+
|
|
29
|
+
# 安装 Playwright 浏览器(必需)
|
|
30
|
+
npx playwright install chromium
|
|
31
|
+
```
|
|
32
|
+
|
|
33
|
+
### 方法 2:使用 npm 安装(推荐)
|
|
34
|
+
|
|
35
|
+
发布后,您可以简单地通过 npm 安装:
|
|
36
|
+
|
|
37
|
+
```bash
|
|
38
|
+
npm install -g mcp-web-reader
|
|
39
|
+
```
|
|
40
|
+
|
|
41
|
+
**首次发布步骤**:
|
|
42
|
+
如果这是第一次发布,请运行提供的发布脚本:
|
|
43
|
+
|
|
44
|
+
```bash
|
|
45
|
+
# 确保已登录 npm
|
|
46
|
+
npm login
|
|
47
|
+
|
|
48
|
+
# 运行发布脚本
|
|
49
|
+
./publish.sh
|
|
50
|
+
```
|
|
51
|
+
|
|
52
|
+
## 配置
|
|
53
|
+
|
|
54
|
+
### 快速配置
|
|
55
|
+
|
|
56
|
+
在 Claude Desktop 的配置文件中添加:
|
|
57
|
+
|
|
58
|
+
**Windows**: `%APPDATA%\Claude\claude_desktop_config.json`
|
|
59
|
+
|
|
60
|
+
**macOS**: `~/Library/Application Support/Claude/claude_desktop_config.json`
|
|
61
|
+
|
|
62
|
+
```json
|
|
63
|
+
{
|
|
64
|
+
"mcpServers": {
|
|
65
|
+
"web-reader": {
|
|
66
|
+
"command": "node",
|
|
67
|
+
"args": ["/absolute/path/to/mcp-web-reader/dist/index.js"]
|
|
68
|
+
}
|
|
69
|
+
}
|
|
70
|
+
}
|
|
71
|
+
```
|
|
72
|
+
|
|
73
|
+
**重要**: 将 `/absolute/path/to/mcp-web-reader/dist/index.js` 替换为你的实际路径。
|
|
74
|
+
|
|
75
|
+
### 详细配置指南
|
|
76
|
+
|
|
77
|
+
📖 **完整的使用指南**:请查看 [USAGE_GUIDE.md](./USAGE_GUIDE.md),包含:
|
|
78
|
+
- **命令行使用(推荐)** - 适合使用 CLI 的用户
|
|
79
|
+
- Claude Desktop 配置
|
|
80
|
+
- Claude Code (Cursor) 配置
|
|
81
|
+
- 其他 MCP 客户端配置
|
|
82
|
+
- 使用示例和故障排除
|
|
83
|
+
|
|
84
|
+
📖 **命令行专用指南**:请查看 [CLI_USAGE.md](./CLI_USAGE.md),包含:
|
|
85
|
+
- 使用 MCP Inspector 测试
|
|
86
|
+
- 创建 CLI 包装器
|
|
87
|
+
- 集成到自定义脚本
|
|
88
|
+
- 命令行工具使用示例
|
|
89
|
+
|
|
90
|
+
## 使用方法
|
|
91
|
+
|
|
92
|
+
### 命令行使用(推荐)
|
|
93
|
+
|
|
94
|
+
如果你主要使用命令行的 Claude,可以使用提供的 CLI 工具:
|
|
95
|
+
|
|
96
|
+
```bash
|
|
97
|
+
# 智能获取(自动降级)
|
|
98
|
+
node cli.js fetch https://example.com
|
|
99
|
+
|
|
100
|
+
# 强制使用 Jina Reader
|
|
101
|
+
node cli.js jina https://example.com
|
|
102
|
+
|
|
103
|
+
# 强制使用本地解析
|
|
104
|
+
node cli.js local https://example.com
|
|
105
|
+
|
|
106
|
+
# 强制使用浏览器模式(适用于微信文章等受限网站)
|
|
107
|
+
node cli.js browser https://mp.weixin.qq.com/...
|
|
108
|
+
```
|
|
109
|
+
|
|
110
|
+
或者使用 MCP Inspector 进行交互式测试:
|
|
111
|
+
|
|
112
|
+
```bash
|
|
113
|
+
npx @modelcontextprotocol/inspector node dist/index.js
|
|
114
|
+
```
|
|
115
|
+
|
|
116
|
+
📖 详细说明请查看 [CLI_USAGE.md](./CLI_USAGE.md)
|
|
117
|
+
|
|
118
|
+
### 在 Claude 中使用
|
|
119
|
+
|
|
120
|
+
配置完成后,在 Claude 中可以使用以下命令:
|
|
121
|
+
|
|
122
|
+
1. **智能获取(推荐)**
|
|
123
|
+
* "请获取 [https://example.com](https://example.com) 的内容"
|
|
124
|
+
* 自动三层降级:Jina Reader → 本地解析 → Playwright 浏览器
|
|
125
|
+
|
|
126
|
+
2. **批量获取**
|
|
127
|
+
* "请获取这些网页:[url1, url2, url3]"
|
|
128
|
+
* 每个URL都享受智能降级策略
|
|
129
|
+
|
|
130
|
+
3. **强制使用 Jina Reader**
|
|
131
|
+
* "使用 Jina Reader 获取 [https://example.com](https://example.com)"
|
|
132
|
+
|
|
133
|
+
4. **强制使用本地解析**
|
|
134
|
+
* "使用本地解析器获取 [https://example.com](https://example.com)"
|
|
135
|
+
|
|
136
|
+
5. **强制使用浏览器模式**
|
|
137
|
+
* "使用浏览器获取 [https://example.com](https://example.com)"
|
|
138
|
+
* 直接跳过其他方式,适用于确定有访问限制的网站
|
|
139
|
+
|
|
140
|
+
## 支持的受限网站类型
|
|
141
|
+
|
|
142
|
+
✅ **微信公众号文章** - 自动绕过访问限制
|
|
143
|
+
✅ **时代杂志、纽约时报** - 突破付费墙和地区限制
|
|
144
|
+
✅ **Cloudflare 保护网站** - 通过真实浏览器绕过检测
|
|
145
|
+
✅ **需要 JavaScript 渲染的页面** - 完整执行页面脚本
|
|
146
|
+
✅ **有验证码/人机验证的网站** - 模拟真实用户行为
|
|
147
|
+
|
|
148
|
+
## 工具列表
|
|
149
|
+
|
|
150
|
+
- `fetch_url` - 智能获取(三层降级:Jina → 本地 → Playwright)
|
|
151
|
+
- `fetch_url_with_jina` - 强制使用 Jina Reader
|
|
152
|
+
- `fetch_url_local` - 强制使用本地解析器
|
|
153
|
+
- `fetch_url_with_browser` - 强制使用 Playwright 浏览器(突破访问限制)
|
|
154
|
+
- `fetch_multiple_urls` - 批量获取多个 URL
|
|
155
|
+
|
|
156
|
+
## 技术架构
|
|
157
|
+
|
|
158
|
+
### 智能降级策略
|
|
159
|
+
```
|
|
160
|
+
用户请求 URL
|
|
161
|
+
↓
|
|
162
|
+
1. Jina Reader API (最快,成功率高)
|
|
163
|
+
↓ 失败
|
|
164
|
+
2. 本地解析器 (Node.js + JSDOM)
|
|
165
|
+
↓ 检测到访问限制
|
|
166
|
+
3. Playwright 浏览器 (真实浏览器,突破限制)
|
|
167
|
+
```
|
|
168
|
+
|
|
169
|
+
### 访问限制检测
|
|
170
|
+
自动识别以下情况并启用浏览器模式:
|
|
171
|
+
- HTTP 状态码:403, 429, 503, 520-524
|
|
172
|
+
- 错误关键词:Cloudflare, CAPTCHA, Access Denied, Rate Limit
|
|
173
|
+
- 内容关键词:Security Check, Human Verification
|
|
174
|
+
|
|
175
|
+
## 开发
|
|
176
|
+
|
|
177
|
+
```bash
|
|
178
|
+
# 开发模式(自动重新编译)
|
|
179
|
+
npm run dev
|
|
180
|
+
|
|
181
|
+
# 构建
|
|
182
|
+
npm run build
|
|
183
|
+
|
|
184
|
+
# 测试运行
|
|
185
|
+
npm start
|
|
186
|
+
|
|
187
|
+
# 安装浏览器二进制文件(首次使用必需)
|
|
188
|
+
npx playwright install chromium
|
|
189
|
+
```
|
|
190
|
+
|
|
191
|
+
## 性能优化
|
|
192
|
+
|
|
193
|
+
- ⚡ **浏览器实例复用** - 避免重复启动开销
|
|
194
|
+
- 🚫 **资源过滤** - 阻止图片、样式表等不必要加载
|
|
195
|
+
- 🎯 **智能选择** - 优先使用快速方法,必要时才用浏览器
|
|
196
|
+
- 💾 **优雅关闭** - 正确清理浏览器资源
|
|
197
|
+
|
|
198
|
+
## 验证安装
|
|
199
|
+
|
|
200
|
+
### 测试 MCP 服务器
|
|
201
|
+
|
|
202
|
+
1. **使用 MCP Inspector 测试**:
|
|
203
|
+
```bash
|
|
204
|
+
npx @modelcontextprotocol/inspector node dist/index.js
|
|
205
|
+
```
|
|
206
|
+
|
|
207
|
+
2. **测试工具功能**:
|
|
208
|
+
在 Inspector 中输入以下 JSON 测试各种工具:
|
|
209
|
+
```json
|
|
210
|
+
{"method": "tools/call", "params": {"name": "fetch_url", "arguments": {"url": "https://example.com"}}}
|
|
211
|
+
```
|
|
212
|
+
|
|
213
|
+
### 在 Claude Desktop 中验证
|
|
214
|
+
|
|
215
|
+
配置完成后,重启 Claude Desktop,然后在对话中输入:
|
|
216
|
+
- "请获取 https://httpbin.org/json 的内容"
|
|
217
|
+
|
|
218
|
+
如果能成功返回内容,说明安装成功。
|
|
219
|
+
|
|
220
|
+
## 故障排除
|
|
221
|
+
|
|
222
|
+
### 常见问题
|
|
223
|
+
|
|
224
|
+
1. **"找不到模块" 错误**
|
|
225
|
+
- 确保已运行 `npm install`
|
|
226
|
+
- 确保已运行 `npm run build`
|
|
227
|
+
|
|
228
|
+
2. **Claude Desktop 无法连接到 MCP 服务器**
|
|
229
|
+
- 检查配置文件路径是否正确
|
|
230
|
+
- 检查 `dist/index.js` 路径是否正确
|
|
231
|
+
- 重启 Claude Desktop
|
|
232
|
+
|
|
233
|
+
3. **Playwright 浏览器相关错误**
|
|
234
|
+
- 确保已运行 `npx playwright install chromium`
|
|
235
|
+
- 检查系统是否支持图形界面(某些服务器环境可能需要额外配置)
|
|
236
|
+
|
|
237
|
+
4. **微信文章无法获取**
|
|
238
|
+
- 微信文章需要 Playwright 浏览器模式
|
|
239
|
+
- 使用 `fetch_url_with_browser` 工具强制使用浏览器
|
|
240
|
+
|
|
241
|
+
### 调试模式
|
|
242
|
+
|
|
243
|
+
启用详细日志:
|
|
244
|
+
```bash
|
|
245
|
+
DEBUG=* node dist/index.js
|
|
246
|
+
```
|
|
247
|
+
|
|
248
|
+
## 贡献
|
|
249
|
+
|
|
250
|
+
欢迎提交 Pull Request!
|
|
251
|
+
|
|
252
|
+
## 许可证
|
|
253
|
+
|
|
254
|
+
MIT License
|
|
255
|
+
|
package/dist/index.d.ts
ADDED
|
@@ -0,0 +1 @@
|
|
|
1
|
+
{"version":3,"file":"index.d.ts","sourceRoot":"","sources":["../src/index.ts"],"names":[],"mappings":""}
|
package/dist/index.js
ADDED
|
@@ -0,0 +1,598 @@
|
|
|
1
|
+
import { Server } from "@modelcontextprotocol/sdk/server/index.js";
|
|
2
|
+
import { StdioServerTransport } from "@modelcontextprotocol/sdk/server/stdio.js";
|
|
3
|
+
import { CallToolRequestSchema, ErrorCode, ListToolsRequestSchema, McpError, } from "@modelcontextprotocol/sdk/types.js";
|
|
4
|
+
import fetch from "node-fetch";
|
|
5
|
+
import { JSDOM } from "jsdom";
|
|
6
|
+
import TurndownService from "turndown";
|
|
7
|
+
import { chromium } from "playwright";
|
|
8
|
+
// 创建服务器实例
|
|
9
|
+
const server = new Server({
|
|
10
|
+
name: "web-reader",
|
|
11
|
+
version: "2.0.0",
|
|
12
|
+
}, {
|
|
13
|
+
capabilities: {
|
|
14
|
+
tools: {},
|
|
15
|
+
},
|
|
16
|
+
});
|
|
17
|
+
// 初始化Turndown服务(将HTML转换为Markdown)
|
|
18
|
+
const turndownService = new TurndownService({
|
|
19
|
+
headingStyle: "atx",
|
|
20
|
+
codeBlockStyle: "fenced",
|
|
21
|
+
});
|
|
22
|
+
// 配置Turndown规则
|
|
23
|
+
turndownService.addRule("skipScripts", {
|
|
24
|
+
filter: ["script", "style", "noscript"],
|
|
25
|
+
replacement: () => "",
|
|
26
|
+
});
|
|
27
|
+
// 浏览器实例管理
|
|
28
|
+
let browser = null;
|
|
29
|
+
// 获取或创建浏览器实例
|
|
30
|
+
async function getBrowser() {
|
|
31
|
+
if (!browser) {
|
|
32
|
+
browser = await chromium.launch({
|
|
33
|
+
headless: true,
|
|
34
|
+
args: [
|
|
35
|
+
'--no-sandbox',
|
|
36
|
+
'--disable-dev-shm-usage',
|
|
37
|
+
'--disable-blink-features=AutomationControlled', // 禁用自动化检测
|
|
38
|
+
'--disable-infobars',
|
|
39
|
+
'--window-size=1920,1080',
|
|
40
|
+
'--start-maximized',
|
|
41
|
+
],
|
|
42
|
+
});
|
|
43
|
+
}
|
|
44
|
+
return browser;
|
|
45
|
+
}
|
|
46
|
+
// 清理浏览器实例
|
|
47
|
+
async function closeBrowser() {
|
|
48
|
+
if (browser) {
|
|
49
|
+
await browser.close();
|
|
50
|
+
browser = null;
|
|
51
|
+
}
|
|
52
|
+
}
|
|
53
|
+
// URL验证函数
|
|
54
|
+
function isValidUrl(urlString) {
|
|
55
|
+
try {
|
|
56
|
+
const url = new URL(urlString);
|
|
57
|
+
return url.protocol === "http:" || url.protocol === "https:";
|
|
58
|
+
}
|
|
59
|
+
catch {
|
|
60
|
+
return false;
|
|
61
|
+
}
|
|
62
|
+
}
|
|
63
|
+
// 检测是否是微信文章链接
|
|
64
|
+
function isWeixinUrl(url) {
|
|
65
|
+
return url.includes('mp.weixin.qq.com') || url.includes('weixin.qq.com');
|
|
66
|
+
}
|
|
67
|
+
// 检测是否需要使用浏览器模式
|
|
68
|
+
function shouldUseBrowser(error, statusCode, content) {
|
|
69
|
+
const errorMessage = error.message.toLowerCase();
|
|
70
|
+
// 基于HTTP状态码判断
|
|
71
|
+
if (statusCode && [403, 429, 503, 520, 521, 522, 523, 524].includes(statusCode)) {
|
|
72
|
+
return true;
|
|
73
|
+
}
|
|
74
|
+
// 基于错误消息判断
|
|
75
|
+
const browserTriggers = [
|
|
76
|
+
'cloudflare',
|
|
77
|
+
'access denied',
|
|
78
|
+
'forbidden',
|
|
79
|
+
'captcha',
|
|
80
|
+
'rate limit',
|
|
81
|
+
'robot',
|
|
82
|
+
'security',
|
|
83
|
+
'blocked',
|
|
84
|
+
'protection',
|
|
85
|
+
'verification required',
|
|
86
|
+
'环境异常',
|
|
87
|
+
'验证'
|
|
88
|
+
];
|
|
89
|
+
if (browserTriggers.some(trigger => errorMessage.includes(trigger))) {
|
|
90
|
+
return true;
|
|
91
|
+
}
|
|
92
|
+
// 基于响应内容判断
|
|
93
|
+
if (content) {
|
|
94
|
+
const contentLower = content.toLowerCase();
|
|
95
|
+
const contentTriggers = [
|
|
96
|
+
'cloudflare',
|
|
97
|
+
'ray id',
|
|
98
|
+
'access denied',
|
|
99
|
+
'security check',
|
|
100
|
+
'human verification',
|
|
101
|
+
'captcha',
|
|
102
|
+
// 微信特有的验证关键词
|
|
103
|
+
'环境异常',
|
|
104
|
+
'去验证',
|
|
105
|
+
'完成验证后即可继续访问',
|
|
106
|
+
'verify'
|
|
107
|
+
];
|
|
108
|
+
if (contentTriggers.some(trigger => contentLower.includes(trigger))) {
|
|
109
|
+
return true;
|
|
110
|
+
}
|
|
111
|
+
}
|
|
112
|
+
return false;
|
|
113
|
+
}
|
|
114
|
+
// 使用Jina Reader获取内容
|
|
115
|
+
async function fetchWithJinaReader(url) {
|
|
116
|
+
try {
|
|
117
|
+
// Jina Reader API URL
|
|
118
|
+
const jinaUrl = `https://r.jina.ai/${url}`;
|
|
119
|
+
// 创建超时控制器
|
|
120
|
+
const controller = new AbortController();
|
|
121
|
+
const timeoutId = setTimeout(() => controller.abort(), 30000);
|
|
122
|
+
const response = await fetch(jinaUrl, {
|
|
123
|
+
headers: {
|
|
124
|
+
"Accept": "text/markdown",
|
|
125
|
+
"User-Agent": "MCP-URLFetcher/2.0",
|
|
126
|
+
},
|
|
127
|
+
signal: controller.signal,
|
|
128
|
+
});
|
|
129
|
+
clearTimeout(timeoutId);
|
|
130
|
+
if (!response.ok) {
|
|
131
|
+
throw new Error(`Jina Reader API error! status: ${response.status}`);
|
|
132
|
+
}
|
|
133
|
+
const markdown = await response.text();
|
|
134
|
+
// 从Markdown中提取标题(通常是第一个#标题)
|
|
135
|
+
const titleMatch = markdown.match(/^#\s+(.+)$/m);
|
|
136
|
+
const title = titleMatch ? titleMatch[1] : "无标题";
|
|
137
|
+
return {
|
|
138
|
+
title,
|
|
139
|
+
content: markdown,
|
|
140
|
+
metadata: {
|
|
141
|
+
url,
|
|
142
|
+
fetchedAt: new Date().toISOString(),
|
|
143
|
+
contentLength: markdown.length,
|
|
144
|
+
method: "jina-reader",
|
|
145
|
+
},
|
|
146
|
+
};
|
|
147
|
+
}
|
|
148
|
+
catch (error) {
|
|
149
|
+
if (error instanceof Error) {
|
|
150
|
+
if (error.name === 'AbortError') {
|
|
151
|
+
throw new Error(`Jina Reader请求超时(30秒)`);
|
|
152
|
+
}
|
|
153
|
+
throw new Error(`Jina Reader获取失败: ${error.message}`);
|
|
154
|
+
}
|
|
155
|
+
throw new Error(`Jina Reader获取失败: ${String(error)}`);
|
|
156
|
+
}
|
|
157
|
+
}
|
|
158
|
+
// 使用Playwright获取网页内容
|
|
159
|
+
async function fetchWithPlaywright(url) {
|
|
160
|
+
let page = null;
|
|
161
|
+
const isWeixin = isWeixinUrl(url);
|
|
162
|
+
try {
|
|
163
|
+
const browserInstance = await getBrowser();
|
|
164
|
+
page = await browserInstance.newPage();
|
|
165
|
+
// 设置真实的 User-Agent(模拟 Chrome on Mac)
|
|
166
|
+
const userAgent = 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/120.0.0.0 Safari/537.36';
|
|
167
|
+
await page.setExtraHTTPHeaders({
|
|
168
|
+
'User-Agent': userAgent,
|
|
169
|
+
'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8',
|
|
170
|
+
'Accept-Language': 'zh-CN,zh;q=0.9,en;q=0.8',
|
|
171
|
+
'Accept-Encoding': 'gzip, deflate, br',
|
|
172
|
+
'Cache-Control': 'no-cache',
|
|
173
|
+
'Pragma': 'no-cache',
|
|
174
|
+
...(isWeixin ? { 'Referer': 'https://mp.weixin.qq.com/' } : {}),
|
|
175
|
+
});
|
|
176
|
+
await page.setViewportSize({ width: 1920, height: 1080 });
|
|
177
|
+
// 微信文章需要加载样式以正确渲染,其他网站可以过滤
|
|
178
|
+
if (!isWeixin) {
|
|
179
|
+
await page.route('**/*', (route) => {
|
|
180
|
+
const resourceType = route.request().resourceType();
|
|
181
|
+
if (['image', 'stylesheet', 'font', 'media'].includes(resourceType)) {
|
|
182
|
+
route.abort();
|
|
183
|
+
}
|
|
184
|
+
else {
|
|
185
|
+
route.continue();
|
|
186
|
+
}
|
|
187
|
+
});
|
|
188
|
+
}
|
|
189
|
+
// 导航到页面,设置更长的超时时间
|
|
190
|
+
await page.goto(url, {
|
|
191
|
+
timeout: 45000,
|
|
192
|
+
waitUntil: 'networkidle' // 等待网络空闲,确保 JS 完全执行
|
|
193
|
+
});
|
|
194
|
+
// 微信文章需要更长的等待时间
|
|
195
|
+
const waitTime = isWeixin ? 5000 : 2000;
|
|
196
|
+
await page.waitForTimeout(waitTime);
|
|
197
|
+
// 获取页面标题
|
|
198
|
+
const title = await page.title() || "无标题";
|
|
199
|
+
// 移除不需要的元素
|
|
200
|
+
await page.evaluate(() => {
|
|
201
|
+
const elementsToRemove = document.querySelectorAll('script, style, nav, header, footer, aside, .advertisement, .ads, .sidebar, .comments, .social-share');
|
|
202
|
+
elementsToRemove.forEach(el => el.remove());
|
|
203
|
+
});
|
|
204
|
+
// 获取主要内容(微信文章有特定的 DOM 结构)
|
|
205
|
+
const htmlContent = await page.evaluate(() => {
|
|
206
|
+
// 微信文章特定选择器
|
|
207
|
+
const weixinContent = document.querySelector('#js_content') ||
|
|
208
|
+
document.querySelector('.rich_media_content');
|
|
209
|
+
if (weixinContent) {
|
|
210
|
+
return weixinContent.innerHTML;
|
|
211
|
+
}
|
|
212
|
+
// 通用选择器
|
|
213
|
+
const mainContent = document.querySelector('main') ||
|
|
214
|
+
document.querySelector('article') ||
|
|
215
|
+
document.querySelector('[role="main"]') ||
|
|
216
|
+
document.querySelector('.content') ||
|
|
217
|
+
document.querySelector('#content') ||
|
|
218
|
+
document.querySelector('.post') ||
|
|
219
|
+
document.querySelector('.entry-content') ||
|
|
220
|
+
document.body;
|
|
221
|
+
return mainContent ? mainContent.innerHTML : document.body.innerHTML;
|
|
222
|
+
});
|
|
223
|
+
// 转换为Markdown
|
|
224
|
+
const markdown = turndownService.turndown(htmlContent);
|
|
225
|
+
// 清理内容
|
|
226
|
+
const cleanedContent = markdown
|
|
227
|
+
.replace(/\n{3,}/g, "\n\n")
|
|
228
|
+
.replace(/^\s+$/gm, "")
|
|
229
|
+
.trim();
|
|
230
|
+
return {
|
|
231
|
+
title,
|
|
232
|
+
content: cleanedContent,
|
|
233
|
+
metadata: {
|
|
234
|
+
url,
|
|
235
|
+
fetchedAt: new Date().toISOString(),
|
|
236
|
+
contentLength: cleanedContent.length,
|
|
237
|
+
method: "playwright-browser",
|
|
238
|
+
},
|
|
239
|
+
};
|
|
240
|
+
}
|
|
241
|
+
catch (error) {
|
|
242
|
+
if (error instanceof Error) {
|
|
243
|
+
throw new Error(`Playwright获取失败: ${error.message}`);
|
|
244
|
+
}
|
|
245
|
+
throw new Error(`Playwright获取失败: ${String(error)}`);
|
|
246
|
+
}
|
|
247
|
+
finally {
|
|
248
|
+
if (page) {
|
|
249
|
+
await page.close();
|
|
250
|
+
}
|
|
251
|
+
}
|
|
252
|
+
}
|
|
253
|
+
// 本地提取网页内容的函数
|
|
254
|
+
async function fetchWithLocalParser(url) {
|
|
255
|
+
try {
|
|
256
|
+
// 创建超时控制器
|
|
257
|
+
const controller = new AbortController();
|
|
258
|
+
const timeoutId = setTimeout(() => controller.abort(), 30000);
|
|
259
|
+
// 发送HTTP请求
|
|
260
|
+
const response = await fetch(url, {
|
|
261
|
+
headers: {
|
|
262
|
+
"User-Agent": "Mozilla/5.0 (compatible; MCP-URLFetcher/2.0)",
|
|
263
|
+
},
|
|
264
|
+
signal: controller.signal,
|
|
265
|
+
});
|
|
266
|
+
clearTimeout(timeoutId);
|
|
267
|
+
if (!response.ok) {
|
|
268
|
+
throw new Error(`HTTP error! status: ${response.status}`);
|
|
269
|
+
}
|
|
270
|
+
// 获取HTML内容
|
|
271
|
+
const html = await response.text();
|
|
272
|
+
// 使用JSDOM解析HTML
|
|
273
|
+
const dom = new JSDOM(html);
|
|
274
|
+
const document = dom.window.document;
|
|
275
|
+
// 获取标题
|
|
276
|
+
const title = document.querySelector("title")?.textContent || "无标题";
|
|
277
|
+
// 移除不需要的元素
|
|
278
|
+
const elementsToRemove = document.querySelectorAll("script, style, nav, header, footer, aside, .advertisement, .ads, .sidebar, .comments");
|
|
279
|
+
elementsToRemove.forEach(el => el.remove());
|
|
280
|
+
// 获取主要内容区域
|
|
281
|
+
const mainContent = document.querySelector("main") ||
|
|
282
|
+
document.querySelector("article") ||
|
|
283
|
+
document.querySelector('[role="main"]') ||
|
|
284
|
+
document.querySelector(".content") ||
|
|
285
|
+
document.querySelector("#content") ||
|
|
286
|
+
document.querySelector(".post") ||
|
|
287
|
+
document.querySelector(".entry-content") ||
|
|
288
|
+
document.body;
|
|
289
|
+
// 转换为Markdown
|
|
290
|
+
const markdown = turndownService.turndown(mainContent.innerHTML);
|
|
291
|
+
// 清理多余的空行和空格
|
|
292
|
+
const cleanedContent = markdown
|
|
293
|
+
.replace(/\n{3,}/g, "\n\n")
|
|
294
|
+
.replace(/^\s+$/gm, "")
|
|
295
|
+
.trim();
|
|
296
|
+
return {
|
|
297
|
+
title,
|
|
298
|
+
content: cleanedContent,
|
|
299
|
+
metadata: {
|
|
300
|
+
url,
|
|
301
|
+
fetchedAt: new Date().toISOString(),
|
|
302
|
+
contentLength: cleanedContent.length,
|
|
303
|
+
method: "local-parser",
|
|
304
|
+
},
|
|
305
|
+
};
|
|
306
|
+
}
|
|
307
|
+
catch (error) {
|
|
308
|
+
if (error instanceof Error) {
|
|
309
|
+
if (error.name === 'AbortError') {
|
|
310
|
+
throw new Error(`本地解析请求超时(30秒)`);
|
|
311
|
+
}
|
|
312
|
+
throw new Error(`本地解析失败: ${error.message}`);
|
|
313
|
+
}
|
|
314
|
+
throw new Error(`本地解析失败: ${String(error)}`);
|
|
315
|
+
}
|
|
316
|
+
}
|
|
317
|
+
// 智能获取网页内容(三层降级策略:Jina → 本地 → Playwright)
|
|
318
|
+
// 对于微信等已知需要浏览器的网站,直接使用浏览器模式
|
|
319
|
+
async function fetchWebContent(url, preferJina = true) {
|
|
320
|
+
// 微信文章直接使用浏览器模式,因为其他方式无法绕过验证
|
|
321
|
+
if (isWeixinUrl(url)) {
|
|
322
|
+
console.error("检测到微信文章,直接使用Playwright浏览器模式");
|
|
323
|
+
return await fetchWithPlaywright(url);
|
|
324
|
+
}
|
|
325
|
+
if (preferJina) {
|
|
326
|
+
// 第一层:尝试Jina Reader
|
|
327
|
+
try {
|
|
328
|
+
return await fetchWithJinaReader(url);
|
|
329
|
+
}
|
|
330
|
+
catch (jinaError) {
|
|
331
|
+
console.error("Jina Reader失败,尝试本地解析:", jinaError instanceof Error ? jinaError.message : String(jinaError));
|
|
332
|
+
// 第二层:尝试本地解析
|
|
333
|
+
try {
|
|
334
|
+
return await fetchWithLocalParser(url);
|
|
335
|
+
}
|
|
336
|
+
catch (localError) {
|
|
337
|
+
console.error("本地解析失败,检查是否需要浏览器模式:", localError instanceof Error ? localError.message : String(localError));
|
|
338
|
+
// 判断是否需要使用浏览器模式
|
|
339
|
+
const jinaErr = jinaError instanceof Error ? jinaError : new Error(String(jinaError));
|
|
340
|
+
const localErr = localError instanceof Error ? localError : new Error(String(localError));
|
|
341
|
+
if (shouldUseBrowser(jinaErr) || shouldUseBrowser(localErr)) {
|
|
342
|
+
console.error("检测到访问限制,使用Playwright浏览器模式");
|
|
343
|
+
try {
|
|
344
|
+
// 第三层:使用Playwright浏览器
|
|
345
|
+
return await fetchWithPlaywright(url);
|
|
346
|
+
}
|
|
347
|
+
catch (browserError) {
|
|
348
|
+
throw new Error(`所有方法都失败了。Jina: ${jinaErr.message}, 本地: ${localErr.message}, 浏览器: ${browserError instanceof Error ? browserError.message : String(browserError)}`);
|
|
349
|
+
}
|
|
350
|
+
}
|
|
351
|
+
else {
|
|
352
|
+
throw new Error(`Jina和本地解析都失败了。Jina: ${jinaErr.message}, 本地: ${localErr.message}`);
|
|
353
|
+
}
|
|
354
|
+
}
|
|
355
|
+
}
|
|
356
|
+
}
|
|
357
|
+
else {
|
|
358
|
+
// 如果不优先使用Jina,直接从本地解析开始
|
|
359
|
+
try {
|
|
360
|
+
return await fetchWithLocalParser(url);
|
|
361
|
+
}
|
|
362
|
+
catch (localError) {
|
|
363
|
+
const localErr = localError instanceof Error ? localError : new Error(String(localError));
|
|
364
|
+
if (shouldUseBrowser(localErr)) {
|
|
365
|
+
console.error("本地解析失败,检测到访问限制,使用Playwright浏览器模式");
|
|
366
|
+
return await fetchWithPlaywright(url);
|
|
367
|
+
}
|
|
368
|
+
else {
|
|
369
|
+
throw localErr;
|
|
370
|
+
}
|
|
371
|
+
}
|
|
372
|
+
}
|
|
373
|
+
}
|
|
374
|
+
// 处理工具列表请求
|
|
375
|
+
server.setRequestHandler(ListToolsRequestSchema, async () => {
|
|
376
|
+
return {
|
|
377
|
+
tools: [
|
|
378
|
+
{
|
|
379
|
+
name: "fetch_url",
|
|
380
|
+
description: "获取指定URL的网页内容,并转换为Markdown格式。默认使用Jina Reader,失败时自动切换到本地解析",
|
|
381
|
+
inputSchema: {
|
|
382
|
+
type: "object",
|
|
383
|
+
properties: {
|
|
384
|
+
url: {
|
|
385
|
+
type: "string",
|
|
386
|
+
description: "要获取内容的网页URL(必须是http或https协议)",
|
|
387
|
+
},
|
|
388
|
+
preferJina: {
|
|
389
|
+
type: "boolean",
|
|
390
|
+
description: "是否优先使用Jina Reader(默认为true)",
|
|
391
|
+
default: true,
|
|
392
|
+
},
|
|
393
|
+
},
|
|
394
|
+
required: ["url"],
|
|
395
|
+
},
|
|
396
|
+
},
|
|
397
|
+
{
|
|
398
|
+
name: "fetch_multiple_urls",
|
|
399
|
+
description: "批量获取多个URL的网页内容",
|
|
400
|
+
inputSchema: {
|
|
401
|
+
type: "object",
|
|
402
|
+
properties: {
|
|
403
|
+
urls: {
|
|
404
|
+
type: "array",
|
|
405
|
+
items: {
|
|
406
|
+
type: "string",
|
|
407
|
+
},
|
|
408
|
+
description: "要获取内容的网页URL列表",
|
|
409
|
+
maxItems: 10, // 限制最多10个URL
|
|
410
|
+
},
|
|
411
|
+
preferJina: {
|
|
412
|
+
type: "boolean",
|
|
413
|
+
description: "是否优先使用Jina Reader(默认为true)",
|
|
414
|
+
default: true,
|
|
415
|
+
},
|
|
416
|
+
},
|
|
417
|
+
required: ["urls"],
|
|
418
|
+
},
|
|
419
|
+
},
|
|
420
|
+
{
|
|
421
|
+
name: "fetch_url_with_jina",
|
|
422
|
+
description: "强制使用Jina Reader获取网页内容(适用于复杂网页)",
|
|
423
|
+
inputSchema: {
|
|
424
|
+
type: "object",
|
|
425
|
+
properties: {
|
|
426
|
+
url: {
|
|
427
|
+
type: "string",
|
|
428
|
+
description: "要获取内容的网页URL",
|
|
429
|
+
},
|
|
430
|
+
},
|
|
431
|
+
required: ["url"],
|
|
432
|
+
},
|
|
433
|
+
},
|
|
434
|
+
{
|
|
435
|
+
name: "fetch_url_local",
|
|
436
|
+
description: "强制使用本地解析器获取网页内容(适用于简单网页或Jina不可用时)",
|
|
437
|
+
inputSchema: {
|
|
438
|
+
type: "object",
|
|
439
|
+
properties: {
|
|
440
|
+
url: {
|
|
441
|
+
type: "string",
|
|
442
|
+
description: "要获取内容的网页URL",
|
|
443
|
+
},
|
|
444
|
+
},
|
|
445
|
+
required: ["url"],
|
|
446
|
+
},
|
|
447
|
+
},
|
|
448
|
+
{
|
|
449
|
+
name: "fetch_url_with_browser",
|
|
450
|
+
description: "强制使用Playwright浏览器获取网页内容(适用于有访问限制的网站,如Cloudflare保护、验证码等)",
|
|
451
|
+
inputSchema: {
|
|
452
|
+
type: "object",
|
|
453
|
+
properties: {
|
|
454
|
+
url: {
|
|
455
|
+
type: "string",
|
|
456
|
+
description: "要获取内容的网页URL",
|
|
457
|
+
},
|
|
458
|
+
},
|
|
459
|
+
required: ["url"],
|
|
460
|
+
},
|
|
461
|
+
},
|
|
462
|
+
],
|
|
463
|
+
};
|
|
464
|
+
});
|
|
465
|
+
// 处理工具调用请求
|
|
466
|
+
server.setRequestHandler(CallToolRequestSchema, async (request) => {
|
|
467
|
+
const { name, arguments: args } = request.params;
|
|
468
|
+
try {
|
|
469
|
+
if (name === "fetch_url") {
|
|
470
|
+
const { url, preferJina = true } = args;
|
|
471
|
+
// 验证URL
|
|
472
|
+
if (!isValidUrl(url)) {
|
|
473
|
+
throw new McpError(ErrorCode.InvalidParams, "无效的URL格式,请提供http或https协议的URL");
|
|
474
|
+
}
|
|
475
|
+
// 获取网页内容
|
|
476
|
+
const result = await fetchWebContent(url, preferJina);
|
|
477
|
+
return {
|
|
478
|
+
content: [
|
|
479
|
+
{
|
|
480
|
+
type: "text",
|
|
481
|
+
text: `# ${result.title}\n\n**URL**: ${result.metadata.url}\n**获取时间**: ${result.metadata.fetchedAt}\n**内容长度**: ${result.metadata.contentLength} 字符\n**解析方法**: ${result.metadata.method}\n\n---\n\n${result.content}`,
|
|
482
|
+
},
|
|
483
|
+
],
|
|
484
|
+
};
|
|
485
|
+
}
|
|
486
|
+
else if (name === "fetch_url_with_jina") {
|
|
487
|
+
const { url } = args;
|
|
488
|
+
if (!isValidUrl(url)) {
|
|
489
|
+
throw new McpError(ErrorCode.InvalidParams, "无效的URL格式");
|
|
490
|
+
}
|
|
491
|
+
const result = await fetchWithJinaReader(url);
|
|
492
|
+
return {
|
|
493
|
+
content: [
|
|
494
|
+
{
|
|
495
|
+
type: "text",
|
|
496
|
+
text: `# ${result.title}\n\n**URL**: ${result.metadata.url}\n**获取时间**: ${result.metadata.fetchedAt}\n**内容长度**: ${result.metadata.contentLength} 字符\n**解析方法**: Jina Reader\n\n---\n\n${result.content}`,
|
|
497
|
+
},
|
|
498
|
+
],
|
|
499
|
+
};
|
|
500
|
+
}
|
|
501
|
+
else if (name === "fetch_url_local") {
|
|
502
|
+
const { url } = args;
|
|
503
|
+
if (!isValidUrl(url)) {
|
|
504
|
+
throw new McpError(ErrorCode.InvalidParams, "无效的URL格式");
|
|
505
|
+
}
|
|
506
|
+
const result = await fetchWithLocalParser(url);
|
|
507
|
+
return {
|
|
508
|
+
content: [
|
|
509
|
+
{
|
|
510
|
+
type: "text",
|
|
511
|
+
text: `# ${result.title}\n\n**URL**: ${result.metadata.url}\n**获取时间**: ${result.metadata.fetchedAt}\n**内容长度**: ${result.metadata.contentLength} 字符\n**解析方法**: 本地解析器\n\n---\n\n${result.content}`,
|
|
512
|
+
},
|
|
513
|
+
],
|
|
514
|
+
};
|
|
515
|
+
}
|
|
516
|
+
else if (name === "fetch_multiple_urls") {
|
|
517
|
+
const { urls, preferJina = true } = args;
|
|
518
|
+
// 验证所有URL
|
|
519
|
+
const invalidUrls = urls.filter(url => !isValidUrl(url));
|
|
520
|
+
if (invalidUrls.length > 0) {
|
|
521
|
+
throw new McpError(ErrorCode.InvalidParams, `以下URL格式无效: ${invalidUrls.join(", ")}`);
|
|
522
|
+
}
|
|
523
|
+
// 并发获取所有URL内容
|
|
524
|
+
const results = await Promise.allSettled(urls.map(url => fetchWebContent(url, preferJina)));
|
|
525
|
+
// 整理结果
|
|
526
|
+
let combinedContent = "# 批量URL内容获取结果\n\n";
|
|
527
|
+
results.forEach((result, index) => {
|
|
528
|
+
const url = urls[index];
|
|
529
|
+
combinedContent += `## ${index + 1}. ${url}\n\n`;
|
|
530
|
+
if (result.status === "fulfilled") {
|
|
531
|
+
const { title, content, metadata } = result.value;
|
|
532
|
+
combinedContent += `**标题**: ${title}\n`;
|
|
533
|
+
combinedContent += `**获取时间**: ${metadata.fetchedAt}\n`;
|
|
534
|
+
combinedContent += `**内容长度**: ${metadata.contentLength} 字符\n`;
|
|
535
|
+
combinedContent += `**解析方法**: ${metadata.method}\n\n`;
|
|
536
|
+
combinedContent += `### 内容\n\n${content}\n\n`;
|
|
537
|
+
}
|
|
538
|
+
else {
|
|
539
|
+
combinedContent += `**错误**: ${result.reason}\n\n`;
|
|
540
|
+
}
|
|
541
|
+
combinedContent += "---\n\n";
|
|
542
|
+
});
|
|
543
|
+
return {
|
|
544
|
+
content: [
|
|
545
|
+
{
|
|
546
|
+
type: "text",
|
|
547
|
+
text: combinedContent,
|
|
548
|
+
},
|
|
549
|
+
],
|
|
550
|
+
};
|
|
551
|
+
}
|
|
552
|
+
else if (name === "fetch_url_with_browser") {
|
|
553
|
+
const { url } = args;
|
|
554
|
+
if (!isValidUrl(url)) {
|
|
555
|
+
throw new McpError(ErrorCode.InvalidParams, "无效的URL格式");
|
|
556
|
+
}
|
|
557
|
+
const result = await fetchWithPlaywright(url);
|
|
558
|
+
return {
|
|
559
|
+
content: [
|
|
560
|
+
{
|
|
561
|
+
type: "text",
|
|
562
|
+
text: `# ${result.title}\n\n**URL**: ${result.metadata.url}\n**获取时间**: ${result.metadata.fetchedAt}\n**内容长度**: ${result.metadata.contentLength} 字符\n**解析方法**: Playwright浏览器\n\n---\n\n${result.content}`,
|
|
563
|
+
},
|
|
564
|
+
],
|
|
565
|
+
};
|
|
566
|
+
}
|
|
567
|
+
else {
|
|
568
|
+
throw new McpError(ErrorCode.MethodNotFound, `未知的工具: ${name}`);
|
|
569
|
+
}
|
|
570
|
+
}
|
|
571
|
+
catch (error) {
|
|
572
|
+
if (error instanceof McpError) {
|
|
573
|
+
throw error;
|
|
574
|
+
}
|
|
575
|
+
throw new McpError(ErrorCode.InternalError, `工具执行失败: ${error instanceof Error ? error.message : String(error)}`);
|
|
576
|
+
}
|
|
577
|
+
});
|
|
578
|
+
// 启动服务器
|
|
579
|
+
async function main() {
|
|
580
|
+
const transport = new StdioServerTransport();
|
|
581
|
+
await server.connect(transport);
|
|
582
|
+
console.error("MCP Web Reader v2.0 已启动(支持Jina Reader + Playwright)");
|
|
583
|
+
}
|
|
584
|
+
// 优雅关闭处理
|
|
585
|
+
process.on('SIGINT', async () => {
|
|
586
|
+
console.error("接收到SIGINT信号,正在关闭浏览器...");
|
|
587
|
+
await closeBrowser();
|
|
588
|
+
process.exit(0);
|
|
589
|
+
});
|
|
590
|
+
process.on('SIGTERM', async () => {
|
|
591
|
+
console.error("接收到SIGTERM信号,正在关闭浏览器...");
|
|
592
|
+
await closeBrowser();
|
|
593
|
+
process.exit(0);
|
|
594
|
+
});
|
|
595
|
+
main().catch((error) => {
|
|
596
|
+
console.error("服务器启动失败:", error);
|
|
597
|
+
process.exit(1);
|
|
598
|
+
});
|
package/package.json
ADDED
|
@@ -0,0 +1,45 @@
|
|
|
1
|
+
{
|
|
2
|
+
"name": "mcp-web-reader",
|
|
3
|
+
"version": "2.0.0",
|
|
4
|
+
"description": "MCP server for reading web content with Jina Reader and local parser support",
|
|
5
|
+
"main": "dist/index.js",
|
|
6
|
+
"bin": {
|
|
7
|
+
"mcp-web-reader": "./dist/index.js"
|
|
8
|
+
},
|
|
9
|
+
"type": "module",
|
|
10
|
+
"scripts": {
|
|
11
|
+
"build": "tsc",
|
|
12
|
+
"start": "node dist/index.js",
|
|
13
|
+
"dev": "tsc --watch",
|
|
14
|
+
"claude-code": "node dist/index.js"
|
|
15
|
+
},
|
|
16
|
+
"repository": {
|
|
17
|
+
"type": "git",
|
|
18
|
+
"url": "git+https://github.com/Gracker/mcp-web-reader.git"
|
|
19
|
+
},
|
|
20
|
+
"bugs": {
|
|
21
|
+
"url": "https://github.com/Gracker/mcp-web-reader/issues"
|
|
22
|
+
},
|
|
23
|
+
"homepage": "https://github.com/Gracker/mcp-web-reader#readme",
|
|
24
|
+
"keywords": ["mcp", "claude", "web-scraping", "jina-reader"],
|
|
25
|
+
"author": "Gracker",
|
|
26
|
+
"license": "MIT",
|
|
27
|
+
"files": [
|
|
28
|
+
"dist",
|
|
29
|
+
"README.md",
|
|
30
|
+
"LICENSE"
|
|
31
|
+
],
|
|
32
|
+
"dependencies": {
|
|
33
|
+
"@modelcontextprotocol/sdk": "^0.5.0",
|
|
34
|
+
"node-fetch": "^3.3.2",
|
|
35
|
+
"jsdom": "^24.0.0",
|
|
36
|
+
"turndown": "^7.1.3",
|
|
37
|
+
"playwright": "^1.40.0"
|
|
38
|
+
},
|
|
39
|
+
"devDependencies": {
|
|
40
|
+
"@types/node": "^20.0.0",
|
|
41
|
+
"@types/jsdom": "^21.1.6",
|
|
42
|
+
"@types/turndown": "^5.0.4",
|
|
43
|
+
"typescript": "^5.3.3"
|
|
44
|
+
}
|
|
45
|
+
}
|