mcp-web-reader 2.0.0 → 2.1.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
Files changed (3) hide show
  1. package/README.md +57 -203
  2. package/dist/index.js +115 -115
  3. package/package.json +3 -2
package/README.md CHANGED
@@ -1,255 +1,109 @@
1
1
  # MCP Web Reader
2
2
 
3
- 一个强大的 MCP (Model Context Protocol) 服务器,让 Claude 和其他大语言模型能够读取和解析网页内容。支持突破访问限制,轻松获取微信文章、时代杂志等受保护内容。
3
+ A powerful MCP (Model Context Protocol) server that enables Claude and other LLMs to read and parse web content. Bypasses access restrictions for WeChat articles, paywalled sites, and Cloudflare-protected pages.
4
4
 
5
- ## 功能特点
5
+ [简体中文](./README_CN.md)
6
6
 
7
- - 🚀 **三引擎支持**:集成 Jina Reader API、本地解析器和 Playwright 浏览器
8
- - 🔄 **智能降级**:Jina Reader → 本地解析 → Playwright 浏览器三层自动切换
9
- - 🌐 **突破限制**:使用 Playwright 处理 Cloudflare、验证码等访问限制
10
- - 📦 **批量处理**:支持同时获取多个 URL
11
- - 🎯 **灵活控制**:可选择强制使用特定解析方式
12
- - 📝 **Markdown 输出**:自动转换为清晰的 Markdown 格式
7
+ ## Features
13
8
 
14
- ## 安装
9
+ - 🚀 **Multi-engine**: Jina Reader API, local parser, and Playwright browser
10
+ - 🔄 **Smart fallback**: Auto-switches Jina → Local → Playwright browser
11
+ - 🌐 **Bypass restrictions**: Cloudflare, CAPTCHAs, access controls
12
+ - 📦 **Batch processing**: Fetch multiple URLs simultaneously
13
+ - 📝 **Markdown output**: Automatic conversion to clean Markdown
15
14
 
16
- ### 方法 1:从源码安装
15
+ ## Installation
17
16
 
18
17
  ```bash
19
- # 克隆仓库
20
- git clone https://github.com/zacfire/mcp-web-reader.git
21
- cd mcp-web-reader
22
-
23
- # 安装依赖
24
- npm install
25
-
26
- # 构建项目
27
- npm run build
28
-
29
- # 安装 Playwright 浏览器(必需)
30
- npx playwright install chromium
18
+ npm install -g mcp-web-reader
31
19
  ```
32
20
 
33
- ### 方法 2:使用 npm 安装(推荐)
21
+ > **Note**: Chromium browser (~100-200MB) will be automatically downloaded. This is required for:
22
+ > - WeChat articles (need browser rendering)
23
+ > - Cloudflare-protected sites
24
+ > - JavaScript-heavy sites
25
+ > - CAPTCHA/access restrictions
34
26
 
35
- 发布后,您可以简单地通过 npm 安装:
27
+ Download may take 1-5 minutes depending on network speed.
36
28
 
37
- ```bash
38
- npm install -g mcp-web-reader
39
- ```
40
-
41
- **首次发布步骤**:
42
- 如果这是第一次发布,请运行提供的发布脚本:
29
+ ### From Source
43
30
 
44
31
  ```bash
45
- # 确保已登录 npm
46
- npm login
47
-
48
- # 运行发布脚本
49
- ./publish.sh
32
+ git clone https://github.com/Gracker/mcp-web-reader.git
33
+ cd mcp-web-reader
34
+ npm install
35
+ npm run build
50
36
  ```
51
37
 
52
- ## 配置
38
+ ## Configuration
53
39
 
54
- ### 快速配置
40
+ ### Claude Desktop
55
41
 
56
- Claude Desktop 的配置文件中添加:
42
+ Add to your config file:
57
43
 
58
44
  **Windows**: `%APPDATA%\Claude\claude_desktop_config.json`
59
-
60
45
  **macOS**: `~/Library/Application Support/Claude/claude_desktop_config.json`
61
46
 
62
47
  ```json
63
48
  {
64
49
  "mcpServers": {
65
50
  "web-reader": {
66
- "command": "node",
67
- "args": ["/absolute/path/to/mcp-web-reader/dist/index.js"]
51
+ "command": "mcp-web-reader"
68
52
  }
69
53
  }
70
54
  }
71
55
  ```
72
56
 
73
- **重要**: `/absolute/path/to/mcp-web-reader/dist/index.js` 替换为你的实际路径。
74
-
75
- ### 详细配置指南
76
-
77
- 📖 **完整的使用指南**:请查看 [USAGE_GUIDE.md](./USAGE_GUIDE.md),包含:
78
- - **命令行使用(推荐)** - 适合使用 CLI 的用户
79
- - Claude Desktop 配置
80
- - Claude Code (Cursor) 配置
81
- - 其他 MCP 客户端配置
82
- - 使用示例和故障排除
83
-
84
- 📖 **命令行专用指南**:请查看 [CLI_USAGE.md](./CLI_USAGE.md),包含:
85
- - 使用 MCP Inspector 测试
86
- - 创建 CLI 包装器
87
- - 集成到自定义脚本
88
- - 命令行工具使用示例
89
-
90
- ## 使用方法
91
-
92
- ### 命令行使用(推荐)
93
-
94
- 如果你主要使用命令行的 Claude,可以使用提供的 CLI 工具:
95
-
96
- ```bash
97
- # 智能获取(自动降级)
98
- node cli.js fetch https://example.com
99
-
100
- # 强制使用 Jina Reader
101
- node cli.js jina https://example.com
102
-
103
- # 强制使用本地解析
104
- node cli.js local https://example.com
105
-
106
- # 强制使用浏览器模式(适用于微信文章等受限网站)
107
- node cli.js browser https://mp.weixin.qq.com/...
108
- ```
109
-
110
- 或者使用 MCP Inspector 进行交互式测试:
57
+ ### Claude Code
111
58
 
112
59
  ```bash
113
- npx @modelcontextprotocol/inspector node dist/index.js
60
+ claude mcp add web-reader -- mcp-web-reader
61
+ claude mcp list
114
62
  ```
115
63
 
116
- 📖 详细说明请查看 [CLI_USAGE.md](./CLI_USAGE.md)
117
-
118
- ### 在 Claude 中使用
119
-
120
- 配置完成后,在 Claude 中可以使用以下命令:
121
-
122
- 1. **智能获取(推荐)**
123
- * "请获取 [https://example.com](https://example.com) 的内容"
124
- * 自动三层降级:Jina Reader → 本地解析 → Playwright 浏览器
125
-
126
- 2. **批量获取**
127
- * "请获取这些网页:[url1, url2, url3]"
128
- * 每个URL都享受智能降级策略
64
+ ## Usage
129
65
 
130
- 3. **强制使用 Jina Reader**
131
- * "使用 Jina Reader 获取 [https://example.com](https://example.com)"
66
+ In Claude:
67
+ - "Fetch content from https://example.com"
68
+ - "Get content using browser for https://mp.weixin.qq.com/..."
69
+ - "Fetch multiple URLs: [url1, url2, url3]"
132
70
 
133
- 4. **强制使用本地解析**
134
- * "使用本地解析器获取 [https://example.com](https://example.com)"
71
+ ## Supported Sites
135
72
 
136
- 5. **强制使用浏览器模式**
137
- * "使用浏览器获取 [https://example.com](https://example.com)"
138
- * 直接跳过其他方式,适用于确定有访问限制的网站
73
+ - WeChat articles (mp.weixin.qq.com)
74
+ - Paywalled sites (NYT, Time Magazine, etc.)
75
+ - Cloudflare-protected sites
76
+ - JavaScript-heavy sites
77
+ - CAPTCHA-protected sites
139
78
 
140
- ## 支持的受限网站类型
79
+ ## Tools
141
80
 
142
- **微信公众号文章** - 自动绕过访问限制
143
- **时代杂志、纽约时报** - 突破付费墙和地区限制
144
- **Cloudflare 保护网站** - 通过真实浏览器绕过检测
145
- **需要 JavaScript 渲染的页面** - 完整执行页面脚本
146
- **有验证码/人机验证的网站** - 模拟真实用户行为
81
+ - `fetch_url` - Smart fetching with automatic fallback
82
+ - `fetch_url_with_jina` - Force Jina Reader
83
+ - `fetch_url_local` - Force local parsing
84
+ - `fetch_url_with_browser` - Force browser mode (for restricted sites)
85
+ - `fetch_multiple_urls` - Batch URL fetching
147
86
 
148
- ## 工具列表
87
+ ## Architecture
149
88
 
150
- - `fetch_url` - 智能获取(三层降级:Jina → 本地 → Playwright)
151
- - `fetch_url_with_jina` - 强制使用 Jina Reader
152
- - `fetch_url_local` - 强制使用本地解析器
153
- - `fetch_url_with_browser` - 强制使用 Playwright 浏览器(突破访问限制)
154
- - `fetch_multiple_urls` - 批量获取多个 URL
155
-
156
- ## 技术架构
157
-
158
- ### 智能降级策略
159
- ```
160
- 用户请求 URL
161
-
162
- 1. Jina Reader API (最快,成功率高)
163
- ↓ 失败
164
- 2. 本地解析器 (Node.js + JSDOM)
165
- ↓ 检测到访问限制
166
- 3. Playwright 浏览器 (真实浏览器,突破限制)
167
- ```
168
-
169
- ### 访问限制检测
170
- 自动识别以下情况并启用浏览器模式:
171
- - HTTP 状态码:403, 429, 503, 520-524
172
- - 错误关键词:Cloudflare, CAPTCHA, Access Denied, Rate Limit
173
- - 内容关键词:Security Check, Human Verification
174
-
175
- ## 开发
176
-
177
- ```bash
178
- # 开发模式(自动重新编译)
179
- npm run dev
180
-
181
- # 构建
182
- npm run build
183
-
184
- # 测试运行
185
- npm start
186
-
187
- # 安装浏览器二进制文件(首次使用必需)
188
- npx playwright install chromium
189
- ```
190
-
191
- ## 性能优化
192
-
193
- - ⚡ **浏览器实例复用** - 避免重复启动开销
194
- - 🚫 **资源过滤** - 阻止图片、样式表等不必要加载
195
- - 🎯 **智能选择** - 优先使用快速方法,必要时才用浏览器
196
- - 💾 **优雅关闭** - 正确清理浏览器资源
197
-
198
- ## 验证安装
199
-
200
- ### 测试 MCP 服务器
201
-
202
- 1. **使用 MCP Inspector 测试**:
203
- ```bash
204
- npx @modelcontextprotocol/inspector node dist/index.js
89
+ Intelligent fallback:
205
90
  ```
206
-
207
- 2. **测试工具功能**:
208
- 在 Inspector 中输入以下 JSON 测试各种工具:
209
- ```json
210
- {"method": "tools/call", "params": {"name": "fetch_url", "arguments": {"url": "https://example.com"}}}
91
+ URL Request → Jina Reader → Local Parser → Playwright Browser
211
92
  ```
212
93
 
213
- ### Claude Desktop 中验证
214
-
215
- 配置完成后,重启 Claude Desktop,然后在对话中输入:
216
- - "请获取 https://httpbin.org/json 的内容"
217
-
218
- 如果能成功返回内容,说明安装成功。
219
-
220
- ## 故障排除
94
+ Auto-detects restrictions and switches to browser for:
95
+ - HTTP status codes: 403, 429, 503, 520-524
96
+ - Keywords: Cloudflare, CAPTCHA, Access Denied
97
+ - Content patterns: Security checks, human verification
221
98
 
222
- ### 常见问题
99
+ ## Development
223
100
 
224
- 1. **"找不到模块" 错误**
225
- - 确保已运行 `npm install`
226
- - 确保已运行 `npm run build`
227
-
228
- 2. **Claude Desktop 无法连接到 MCP 服务器**
229
- - 检查配置文件路径是否正确
230
- - 检查 `dist/index.js` 路径是否正确
231
- - 重启 Claude Desktop
232
-
233
- 3. **Playwright 浏览器相关错误**
234
- - 确保已运行 `npx playwright install chromium`
235
- - 检查系统是否支持图形界面(某些服务器环境可能需要额外配置)
236
-
237
- 4. **微信文章无法获取**
238
- - 微信文章需要 Playwright 浏览器模式
239
- - 使用 `fetch_url_with_browser` 工具强制使用浏览器
240
-
241
- ### 调试模式
242
-
243
- 启用详细日志:
244
101
  ```bash
245
- DEBUG=* node dist/index.js
102
+ npm run dev # Development with auto-rebuild
103
+ npm run build # Build production version
104
+ npm start # Test run
246
105
  ```
247
106
 
248
- ## 贡献
249
-
250
- 欢迎提交 Pull Request!
251
-
252
- ## 许可证
107
+ ## License
253
108
 
254
109
  MIT License
255
-
package/dist/index.js CHANGED
@@ -5,7 +5,7 @@ import fetch from "node-fetch";
5
5
  import { JSDOM } from "jsdom";
6
6
  import TurndownService from "turndown";
7
7
  import { chromium } from "playwright";
8
- // 创建服务器实例
8
+ // Create server instance
9
9
  const server = new Server({
10
10
  name: "web-reader",
11
11
  version: "2.0.0",
@@ -14,19 +14,19 @@ const server = new Server({
14
14
  tools: {},
15
15
  },
16
16
  });
17
- // 初始化Turndown服务(将HTML转换为Markdown
17
+ // Initialize Turndown service (convert HTML to Markdown)
18
18
  const turndownService = new TurndownService({
19
19
  headingStyle: "atx",
20
20
  codeBlockStyle: "fenced",
21
21
  });
22
- // 配置Turndown规则
22
+ // Configure Turndown rules
23
23
  turndownService.addRule("skipScripts", {
24
24
  filter: ["script", "style", "noscript"],
25
25
  replacement: () => "",
26
26
  });
27
- // 浏览器实例管理
27
+ // Browser instance management
28
28
  let browser = null;
29
- // 获取或创建浏览器实例
29
+ // Get or create browser instance
30
30
  async function getBrowser() {
31
31
  if (!browser) {
32
32
  browser = await chromium.launch({
@@ -34,7 +34,7 @@ async function getBrowser() {
34
34
  args: [
35
35
  '--no-sandbox',
36
36
  '--disable-dev-shm-usage',
37
- '--disable-blink-features=AutomationControlled', // 禁用自动化检测
37
+ '--disable-blink-features=AutomationControlled', // Disable automation detection
38
38
  '--disable-infobars',
39
39
  '--window-size=1920,1080',
40
40
  '--start-maximized',
@@ -43,14 +43,14 @@ async function getBrowser() {
43
43
  }
44
44
  return browser;
45
45
  }
46
- // 清理浏览器实例
46
+ // Clean up browser instance
47
47
  async function closeBrowser() {
48
48
  if (browser) {
49
49
  await browser.close();
50
50
  browser = null;
51
51
  }
52
52
  }
53
- // URL验证函数
53
+ // URL validation function
54
54
  function isValidUrl(urlString) {
55
55
  try {
56
56
  const url = new URL(urlString);
@@ -60,18 +60,18 @@ function isValidUrl(urlString) {
60
60
  return false;
61
61
  }
62
62
  }
63
- // 检测是否是微信文章链接
63
+ // Check if it's a WeChat article link
64
64
  function isWeixinUrl(url) {
65
65
  return url.includes('mp.weixin.qq.com') || url.includes('weixin.qq.com');
66
66
  }
67
- // 检测是否需要使用浏览器模式
67
+ // Check if browser mode is needed
68
68
  function shouldUseBrowser(error, statusCode, content) {
69
69
  const errorMessage = error.message.toLowerCase();
70
- // 基于HTTP状态码判断
70
+ // Based on HTTP status codes
71
71
  if (statusCode && [403, 429, 503, 520, 521, 522, 523, 524].includes(statusCode)) {
72
72
  return true;
73
73
  }
74
- // 基于错误消息判断
74
+ // Based on error messages
75
75
  const browserTriggers = [
76
76
  'cloudflare',
77
77
  'access denied',
@@ -83,13 +83,13 @@ function shouldUseBrowser(error, statusCode, content) {
83
83
  'blocked',
84
84
  'protection',
85
85
  'verification required',
86
- '环境异常',
87
- '验证'
86
+ 'environment anomaly',
87
+ 'verify'
88
88
  ];
89
89
  if (browserTriggers.some(trigger => errorMessage.includes(trigger))) {
90
90
  return true;
91
91
  }
92
- // 基于响应内容判断
92
+ // Based on response content
93
93
  if (content) {
94
94
  const contentLower = content.toLowerCase();
95
95
  const contentTriggers = [
@@ -99,10 +99,10 @@ function shouldUseBrowser(error, statusCode, content) {
99
99
  'security check',
100
100
  'human verification',
101
101
  'captcha',
102
- // 微信特有的验证关键词
103
- '环境异常',
104
- '去验证',
105
- '完成验证后即可继续访问',
102
+ // WeChat-specific verification keywords
103
+ 'environment anomaly',
104
+ 'verify',
105
+ 'complete verification to continue',
106
106
  'verify'
107
107
  ];
108
108
  if (contentTriggers.some(trigger => contentLower.includes(trigger))) {
@@ -111,12 +111,12 @@ function shouldUseBrowser(error, statusCode, content) {
111
111
  }
112
112
  return false;
113
113
  }
114
- // 使用Jina Reader获取内容
114
+ // Fetch content using Jina Reader
115
115
  async function fetchWithJinaReader(url) {
116
116
  try {
117
117
  // Jina Reader API URL
118
118
  const jinaUrl = `https://r.jina.ai/${url}`;
119
- // 创建超时控制器
119
+ // Create timeout controller
120
120
  const controller = new AbortController();
121
121
  const timeoutId = setTimeout(() => controller.abort(), 30000);
122
122
  const response = await fetch(jinaUrl, {
@@ -131,9 +131,9 @@ async function fetchWithJinaReader(url) {
131
131
  throw new Error(`Jina Reader API error! status: ${response.status}`);
132
132
  }
133
133
  const markdown = await response.text();
134
- // Markdown中提取标题(通常是第一个#标题)
134
+ // Extract title from Markdown (usually the first # heading)
135
135
  const titleMatch = markdown.match(/^#\s+(.+)$/m);
136
- const title = titleMatch ? titleMatch[1] : "无标题";
136
+ const title = titleMatch ? titleMatch[1] : "No title";
137
137
  return {
138
138
  title,
139
139
  content: markdown,
@@ -148,21 +148,21 @@ async function fetchWithJinaReader(url) {
148
148
  catch (error) {
149
149
  if (error instanceof Error) {
150
150
  if (error.name === 'AbortError') {
151
- throw new Error(`Jina Reader请求超时(30秒)`);
151
+ throw new Error(`Jina Reader request timeout (30s)`);
152
152
  }
153
- throw new Error(`Jina Reader获取失败: ${error.message}`);
153
+ throw new Error(`Jina Reader fetch failed: ${error.message}`);
154
154
  }
155
- throw new Error(`Jina Reader获取失败: ${String(error)}`);
155
+ throw new Error(`Jina Reader fetch failed: ${String(error)}`);
156
156
  }
157
157
  }
158
- // 使用Playwright获取网页内容
158
+ // Fetch web content using Playwright
159
159
  async function fetchWithPlaywright(url) {
160
160
  let page = null;
161
161
  const isWeixin = isWeixinUrl(url);
162
162
  try {
163
163
  const browserInstance = await getBrowser();
164
164
  page = await browserInstance.newPage();
165
- // 设置真实的 User-Agent(模拟 Chrome on Mac
165
+ // Set real User-Agent (simulate Chrome on Mac)
166
166
  const userAgent = 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/120.0.0.0 Safari/537.36';
167
167
  await page.setExtraHTTPHeaders({
168
168
  'User-Agent': userAgent,
@@ -174,7 +174,7 @@ async function fetchWithPlaywright(url) {
174
174
  ...(isWeixin ? { 'Referer': 'https://mp.weixin.qq.com/' } : {}),
175
175
  });
176
176
  await page.setViewportSize({ width: 1920, height: 1080 });
177
- // 微信文章需要加载样式以正确渲染,其他网站可以过滤
177
+ // WeChat articles need to load styles for correct rendering, filter for other sites
178
178
  if (!isWeixin) {
179
179
  await page.route('**/*', (route) => {
180
180
  const resourceType = route.request().resourceType();
@@ -186,30 +186,30 @@ async function fetchWithPlaywright(url) {
186
186
  }
187
187
  });
188
188
  }
189
- // 导航到页面,设置更长的超时时间
189
+ // Navigate to page with longer timeout
190
190
  await page.goto(url, {
191
191
  timeout: 45000,
192
- waitUntil: 'networkidle' // 等待网络空闲,确保 JS 完全执行
192
+ waitUntil: 'networkidle' // Wait for network idle to ensure JS execution
193
193
  });
194
- // 微信文章需要更长的等待时间
194
+ // WeChat articles need longer wait time
195
195
  const waitTime = isWeixin ? 5000 : 2000;
196
196
  await page.waitForTimeout(waitTime);
197
- // 获取页面标题
198
- const title = await page.title() || "无标题";
199
- // 移除不需要的元素
197
+ // Get page title
198
+ const title = await page.title() || "No title";
199
+ // Remove unwanted elements
200
200
  await page.evaluate(() => {
201
201
  const elementsToRemove = document.querySelectorAll('script, style, nav, header, footer, aside, .advertisement, .ads, .sidebar, .comments, .social-share');
202
202
  elementsToRemove.forEach(el => el.remove());
203
203
  });
204
- // 获取主要内容(微信文章有特定的 DOM 结构)
204
+ // Get main content (WeChat articles have specific DOM structure)
205
205
  const htmlContent = await page.evaluate(() => {
206
- // 微信文章特定选择器
206
+ // WeChat article specific selectors
207
207
  const weixinContent = document.querySelector('#js_content') ||
208
208
  document.querySelector('.rich_media_content');
209
209
  if (weixinContent) {
210
210
  return weixinContent.innerHTML;
211
211
  }
212
- // 通用选择器
212
+ // Common selectors
213
213
  const mainContent = document.querySelector('main') ||
214
214
  document.querySelector('article') ||
215
215
  document.querySelector('[role="main"]') ||
@@ -220,9 +220,9 @@ async function fetchWithPlaywright(url) {
220
220
  document.body;
221
221
  return mainContent ? mainContent.innerHTML : document.body.innerHTML;
222
222
  });
223
- // 转换为Markdown
223
+ // Convert to Markdown
224
224
  const markdown = turndownService.turndown(htmlContent);
225
- // 清理内容
225
+ // Clean content
226
226
  const cleanedContent = markdown
227
227
  .replace(/\n{3,}/g, "\n\n")
228
228
  .replace(/^\s+$/gm, "")
@@ -240,9 +240,9 @@ async function fetchWithPlaywright(url) {
240
240
  }
241
241
  catch (error) {
242
242
  if (error instanceof Error) {
243
- throw new Error(`Playwright获取失败: ${error.message}`);
243
+ throw new Error(`Playwright fetch failed: ${error.message}`);
244
244
  }
245
- throw new Error(`Playwright获取失败: ${String(error)}`);
245
+ throw new Error(`Playwright fetch failed: ${String(error)}`);
246
246
  }
247
247
  finally {
248
248
  if (page) {
@@ -250,13 +250,13 @@ async function fetchWithPlaywright(url) {
250
250
  }
251
251
  }
252
252
  }
253
- // 本地提取网页内容的函数
253
+ // Local web content extraction function
254
254
  async function fetchWithLocalParser(url) {
255
255
  try {
256
- // 创建超时控制器
256
+ // Create timeout controller
257
257
  const controller = new AbortController();
258
258
  const timeoutId = setTimeout(() => controller.abort(), 30000);
259
- // 发送HTTP请求
259
+ // Send HTTP request
260
260
  const response = await fetch(url, {
261
261
  headers: {
262
262
  "User-Agent": "Mozilla/5.0 (compatible; MCP-URLFetcher/2.0)",
@@ -267,17 +267,17 @@ async function fetchWithLocalParser(url) {
267
267
  if (!response.ok) {
268
268
  throw new Error(`HTTP error! status: ${response.status}`);
269
269
  }
270
- // 获取HTML内容
270
+ // Get HTML content
271
271
  const html = await response.text();
272
- // 使用JSDOM解析HTML
272
+ // Parse HTML with JSDOM
273
273
  const dom = new JSDOM(html);
274
274
  const document = dom.window.document;
275
- // 获取标题
276
- const title = document.querySelector("title")?.textContent || "无标题";
277
- // 移除不需要的元素
275
+ // Get title
276
+ const title = document.querySelector("title")?.textContent || "No title";
277
+ // Remove unwanted elements
278
278
  const elementsToRemove = document.querySelectorAll("script, style, nav, header, footer, aside, .advertisement, .ads, .sidebar, .comments");
279
279
  elementsToRemove.forEach(el => el.remove());
280
- // 获取主要内容区域
280
+ // Get main content area
281
281
  const mainContent = document.querySelector("main") ||
282
282
  document.querySelector("article") ||
283
283
  document.querySelector('[role="main"]') ||
@@ -286,9 +286,9 @@ async function fetchWithLocalParser(url) {
286
286
  document.querySelector(".post") ||
287
287
  document.querySelector(".entry-content") ||
288
288
  document.body;
289
- // 转换为Markdown
289
+ // Convert to Markdown
290
290
  const markdown = turndownService.turndown(mainContent.innerHTML);
291
- // 清理多余的空行和空格
291
+ // Clean extra whitespace
292
292
  const cleanedContent = markdown
293
293
  .replace(/\n{3,}/g, "\n\n")
294
294
  .replace(/^\s+$/gm, "")
@@ -307,62 +307,62 @@ async function fetchWithLocalParser(url) {
307
307
  catch (error) {
308
308
  if (error instanceof Error) {
309
309
  if (error.name === 'AbortError') {
310
- throw new Error(`本地解析请求超时(30秒)`);
310
+ throw new Error(`Local parser request timeout (30s)`);
311
311
  }
312
- throw new Error(`本地解析失败: ${error.message}`);
312
+ throw new Error(`Local parser failed: ${error.message}`);
313
313
  }
314
- throw new Error(`本地解析失败: ${String(error)}`);
314
+ throw new Error(`Local parser failed: ${String(error)}`);
315
315
  }
316
316
  }
317
- // 智能获取网页内容(三层降级策略:Jina → 本地 → Playwright
318
- // 对于微信等已知需要浏览器的网站,直接使用浏览器模式
317
+ // Smart web content fetching (three-tier fallback: Jina → Local → Playwright)
318
+ // For known sites requiring browser (like WeChat), use browser mode directly
319
319
  async function fetchWebContent(url, preferJina = true) {
320
- // 微信文章直接使用浏览器模式,因为其他方式无法绕过验证
320
+ // WeChat articles use browser mode directly as other methods cannot bypass verification
321
321
  if (isWeixinUrl(url)) {
322
- console.error("检测到微信文章,直接使用Playwright浏览器模式");
322
+ console.error("Detected WeChat article, using Playwright browser mode");
323
323
  return await fetchWithPlaywright(url);
324
324
  }
325
325
  if (preferJina) {
326
- // 第一层:尝试Jina Reader
326
+ // Tier 1: Try Jina Reader
327
327
  try {
328
328
  return await fetchWithJinaReader(url);
329
329
  }
330
330
  catch (jinaError) {
331
- console.error("Jina Reader失败,尝试本地解析:", jinaError instanceof Error ? jinaError.message : String(jinaError));
332
- // 第二层:尝试本地解析
331
+ console.error("Jina Reader failed, trying local parser:", jinaError instanceof Error ? jinaError.message : String(jinaError));
332
+ // Tier 2: Try local parser
333
333
  try {
334
334
  return await fetchWithLocalParser(url);
335
335
  }
336
336
  catch (localError) {
337
- console.error("本地解析失败,检查是否需要浏览器模式:", localError instanceof Error ? localError.message : String(localError));
338
- // 判断是否需要使用浏览器模式
337
+ console.error("Local parser failed, checking if browser mode needed:", localError instanceof Error ? localError.message : String(localError));
338
+ // Check if browser mode is needed
339
339
  const jinaErr = jinaError instanceof Error ? jinaError : new Error(String(jinaError));
340
340
  const localErr = localError instanceof Error ? localError : new Error(String(localError));
341
341
  if (shouldUseBrowser(jinaErr) || shouldUseBrowser(localErr)) {
342
- console.error("检测到访问限制,使用Playwright浏览器模式");
342
+ console.error("Detected access restrictions, using Playwright browser mode");
343
343
  try {
344
- // 第三层:使用Playwright浏览器
344
+ // Tier 3: Use Playwright browser
345
345
  return await fetchWithPlaywright(url);
346
346
  }
347
347
  catch (browserError) {
348
- throw new Error(`所有方法都失败了。Jina: ${jinaErr.message}, 本地: ${localErr.message}, 浏览器: ${browserError instanceof Error ? browserError.message : String(browserError)}`);
348
+ throw new Error(`All methods failed. Jina: ${jinaErr.message}, Local: ${localErr.message}, Browser: ${browserError instanceof Error ? browserError.message : String(browserError)}`);
349
349
  }
350
350
  }
351
351
  else {
352
- throw new Error(`Jina和本地解析都失败了。Jina: ${jinaErr.message}, 本地: ${localErr.message}`);
352
+ throw new Error(`Jina and local parser both failed. Jina: ${jinaErr.message}, Local: ${localErr.message}`);
353
353
  }
354
354
  }
355
355
  }
356
356
  }
357
357
  else {
358
- // 如果不优先使用Jina,直接从本地解析开始
358
+ // If not prioritizing Jina, start with local parser
359
359
  try {
360
360
  return await fetchWithLocalParser(url);
361
361
  }
362
362
  catch (localError) {
363
363
  const localErr = localError instanceof Error ? localError : new Error(String(localError));
364
364
  if (shouldUseBrowser(localErr)) {
365
- console.error("本地解析失败,检测到访问限制,使用Playwright浏览器模式");
365
+ console.error("Local parser failed, detected access restrictions, using Playwright browser mode");
366
366
  return await fetchWithPlaywright(url);
367
367
  }
368
368
  else {
@@ -371,23 +371,23 @@ async function fetchWebContent(url, preferJina = true) {
371
371
  }
372
372
  }
373
373
  }
374
- // 处理工具列表请求
374
+ // Handle tool list requests
375
375
  server.setRequestHandler(ListToolsRequestSchema, async () => {
376
376
  return {
377
377
  tools: [
378
378
  {
379
379
  name: "fetch_url",
380
- description: "获取指定URL的网页内容,并转换为Markdown格式。默认使用Jina Reader,失败时自动切换到本地解析",
380
+ description: "Fetch web content from specified URL and convert to Markdown format. Uses Jina Reader by default, automatically falls back to local parser on failure",
381
381
  inputSchema: {
382
382
  type: "object",
383
383
  properties: {
384
384
  url: {
385
385
  type: "string",
386
- description: "要获取内容的网页URL(必须是httphttps协议)",
386
+ description: "Webpage URL to fetch (must be http or https protocol)",
387
387
  },
388
388
  preferJina: {
389
389
  type: "boolean",
390
- description: "是否优先使用Jina Reader(默认为true",
390
+ description: "Whether to prioritize Jina Reader (default: true)",
391
391
  default: true,
392
392
  },
393
393
  },
@@ -396,7 +396,7 @@ server.setRequestHandler(ListToolsRequestSchema, async () => {
396
396
  },
397
397
  {
398
398
  name: "fetch_multiple_urls",
399
- description: "批量获取多个URL的网页内容",
399
+ description: "Batch fetch web content from multiple URLs",
400
400
  inputSchema: {
401
401
  type: "object",
402
402
  properties: {
@@ -405,12 +405,12 @@ server.setRequestHandler(ListToolsRequestSchema, async () => {
405
405
  items: {
406
406
  type: "string",
407
407
  },
408
- description: "要获取内容的网页URL列表",
409
- maxItems: 10, // 限制最多10个URL
408
+ description: "List of webpage URLs to fetch",
409
+ maxItems: 10, // Limit to 10 URLs
410
410
  },
411
411
  preferJina: {
412
412
  type: "boolean",
413
- description: "是否优先使用Jina Reader(默认为true",
413
+ description: "Whether to prioritize Jina Reader (default: true)",
414
414
  default: true,
415
415
  },
416
416
  },
@@ -419,13 +419,13 @@ server.setRequestHandler(ListToolsRequestSchema, async () => {
419
419
  },
420
420
  {
421
421
  name: "fetch_url_with_jina",
422
- description: "强制使用Jina Reader获取网页内容(适用于复杂网页)",
422
+ description: "Force fetch using Jina Reader (suitable for complex webpages)",
423
423
  inputSchema: {
424
424
  type: "object",
425
425
  properties: {
426
426
  url: {
427
427
  type: "string",
428
- description: "要获取内容的网页URL",
428
+ description: "Webpage URL to fetch",
429
429
  },
430
430
  },
431
431
  required: ["url"],
@@ -433,13 +433,13 @@ server.setRequestHandler(ListToolsRequestSchema, async () => {
433
433
  },
434
434
  {
435
435
  name: "fetch_url_local",
436
- description: "强制使用本地解析器获取网页内容(适用于简单网页或Jina不可用时)",
436
+ description: "Force fetch using local parser (suitable for simple webpages or when Jina is unavailable)",
437
437
  inputSchema: {
438
438
  type: "object",
439
439
  properties: {
440
440
  url: {
441
441
  type: "string",
442
- description: "要获取内容的网页URL",
442
+ description: "Webpage URL to fetch",
443
443
  },
444
444
  },
445
445
  required: ["url"],
@@ -447,13 +447,13 @@ server.setRequestHandler(ListToolsRequestSchema, async () => {
447
447
  },
448
448
  {
449
449
  name: "fetch_url_with_browser",
450
- description: "强制使用Playwright浏览器获取网页内容(适用于有访问限制的网站,如Cloudflare保护、验证码等)",
450
+ description: "Force fetch using Playwright browser (suitable for websites with access restrictions, such as Cloudflare protection, CAPTCHA, etc.)",
451
451
  inputSchema: {
452
452
  type: "object",
453
453
  properties: {
454
454
  url: {
455
455
  type: "string",
456
- description: "要获取内容的网页URL",
456
+ description: "Webpage URL to fetch",
457
457
  },
458
458
  },
459
459
  required: ["url"],
@@ -462,23 +462,23 @@ server.setRequestHandler(ListToolsRequestSchema, async () => {
462
462
  ],
463
463
  };
464
464
  });
465
- // 处理工具调用请求
465
+ // Handle tool call requests
466
466
  server.setRequestHandler(CallToolRequestSchema, async (request) => {
467
467
  const { name, arguments: args } = request.params;
468
468
  try {
469
469
  if (name === "fetch_url") {
470
470
  const { url, preferJina = true } = args;
471
- // 验证URL
471
+ // Validate URL
472
472
  if (!isValidUrl(url)) {
473
- throw new McpError(ErrorCode.InvalidParams, "无效的URL格式,请提供httphttps协议的URL");
473
+ throw new McpError(ErrorCode.InvalidParams, "Invalid URL format, please provide http or https protocol URL");
474
474
  }
475
- // 获取网页内容
475
+ // Fetch web content
476
476
  const result = await fetchWebContent(url, preferJina);
477
477
  return {
478
478
  content: [
479
479
  {
480
480
  type: "text",
481
- text: `# ${result.title}\n\n**URL**: ${result.metadata.url}\n**获取时间**: ${result.metadata.fetchedAt}\n**内容长度**: ${result.metadata.contentLength} 字符\n**解析方法**: ${result.metadata.method}\n\n---\n\n${result.content}`,
481
+ text: `# ${result.title}\n\n**URL**: ${result.metadata.url}\n**Fetched At**: ${result.metadata.fetchedAt}\n**Content Length**: ${result.metadata.contentLength} characters\n**Method**: ${result.metadata.method}\n\n---\n\n${result.content}`,
482
482
  },
483
483
  ],
484
484
  };
@@ -486,14 +486,14 @@ server.setRequestHandler(CallToolRequestSchema, async (request) => {
486
486
  else if (name === "fetch_url_with_jina") {
487
487
  const { url } = args;
488
488
  if (!isValidUrl(url)) {
489
- throw new McpError(ErrorCode.InvalidParams, "无效的URL格式");
489
+ throw new McpError(ErrorCode.InvalidParams, "Invalid URL format");
490
490
  }
491
491
  const result = await fetchWithJinaReader(url);
492
492
  return {
493
493
  content: [
494
494
  {
495
495
  type: "text",
496
- text: `# ${result.title}\n\n**URL**: ${result.metadata.url}\n**获取时间**: ${result.metadata.fetchedAt}\n**内容长度**: ${result.metadata.contentLength} 字符\n**解析方法**: Jina Reader\n\n---\n\n${result.content}`,
496
+ text: `# ${result.title}\n\n**URL**: ${result.metadata.url}\n**Fetched At**: ${result.metadata.fetchedAt}\n**Content Length**: ${result.metadata.contentLength} characters\n**Method**: Jina Reader\n\n---\n\n${result.content}`,
497
497
  },
498
498
  ],
499
499
  };
@@ -501,42 +501,42 @@ server.setRequestHandler(CallToolRequestSchema, async (request) => {
501
501
  else if (name === "fetch_url_local") {
502
502
  const { url } = args;
503
503
  if (!isValidUrl(url)) {
504
- throw new McpError(ErrorCode.InvalidParams, "无效的URL格式");
504
+ throw new McpError(ErrorCode.InvalidParams, "Invalid URL format");
505
505
  }
506
506
  const result = await fetchWithLocalParser(url);
507
507
  return {
508
508
  content: [
509
509
  {
510
510
  type: "text",
511
- text: `# ${result.title}\n\n**URL**: ${result.metadata.url}\n**获取时间**: ${result.metadata.fetchedAt}\n**内容长度**: ${result.metadata.contentLength} 字符\n**解析方法**: 本地解析器\n\n---\n\n${result.content}`,
511
+ text: `# ${result.title}\n\n**URL**: ${result.metadata.url}\n**Fetched At**: ${result.metadata.fetchedAt}\n**Content Length**: ${result.metadata.contentLength} characters\n**Method**: Local Parser\n\n---\n\n${result.content}`,
512
512
  },
513
513
  ],
514
514
  };
515
515
  }
516
516
  else if (name === "fetch_multiple_urls") {
517
517
  const { urls, preferJina = true } = args;
518
- // 验证所有URL
518
+ // Validate all URLs
519
519
  const invalidUrls = urls.filter(url => !isValidUrl(url));
520
520
  if (invalidUrls.length > 0) {
521
- throw new McpError(ErrorCode.InvalidParams, `以下URL格式无效: ${invalidUrls.join(", ")}`);
521
+ throw new McpError(ErrorCode.InvalidParams, `The following URLs have invalid format: ${invalidUrls.join(", ")}`);
522
522
  }
523
- // 并发获取所有URL内容
523
+ // Fetch all URLs concurrently
524
524
  const results = await Promise.allSettled(urls.map(url => fetchWebContent(url, preferJina)));
525
- // 整理结果
526
- let combinedContent = "# 批量URL内容获取结果\n\n";
525
+ // Combine results
526
+ let combinedContent = "# Batch URL Content Fetch Results\n\n";
527
527
  results.forEach((result, index) => {
528
528
  const url = urls[index];
529
529
  combinedContent += `## ${index + 1}. ${url}\n\n`;
530
530
  if (result.status === "fulfilled") {
531
531
  const { title, content, metadata } = result.value;
532
- combinedContent += `**标题**: ${title}\n`;
533
- combinedContent += `**获取时间**: ${metadata.fetchedAt}\n`;
534
- combinedContent += `**内容长度**: ${metadata.contentLength} 字符\n`;
535
- combinedContent += `**解析方法**: ${metadata.method}\n\n`;
536
- combinedContent += `### 内容\n\n${content}\n\n`;
532
+ combinedContent += `**Title**: ${title}\n`;
533
+ combinedContent += `**Fetched At**: ${metadata.fetchedAt}\n`;
534
+ combinedContent += `**Content Length**: ${metadata.contentLength} characters\n`;
535
+ combinedContent += `**Method**: ${metadata.method}\n\n`;
536
+ combinedContent += `### Content\n\n${content}\n\n`;
537
537
  }
538
538
  else {
539
- combinedContent += `**错误**: ${result.reason}\n\n`;
539
+ combinedContent += `**Error**: ${result.reason}\n\n`;
540
540
  }
541
541
  combinedContent += "---\n\n";
542
542
  });
@@ -552,47 +552,47 @@ server.setRequestHandler(CallToolRequestSchema, async (request) => {
552
552
  else if (name === "fetch_url_with_browser") {
553
553
  const { url } = args;
554
554
  if (!isValidUrl(url)) {
555
- throw new McpError(ErrorCode.InvalidParams, "无效的URL格式");
555
+ throw new McpError(ErrorCode.InvalidParams, "Invalid URL format");
556
556
  }
557
557
  const result = await fetchWithPlaywright(url);
558
558
  return {
559
559
  content: [
560
560
  {
561
561
  type: "text",
562
- text: `# ${result.title}\n\n**URL**: ${result.metadata.url}\n**获取时间**: ${result.metadata.fetchedAt}\n**内容长度**: ${result.metadata.contentLength} 字符\n**解析方法**: Playwright浏览器\n\n---\n\n${result.content}`,
562
+ text: `# ${result.title}\n\n**URL**: ${result.metadata.url}\n**Fetched At**: ${result.metadata.fetchedAt}\n**Content Length**: ${result.metadata.contentLength} characters\n**Method**: Playwright Browser\n\n---\n\n${result.content}`,
563
563
  },
564
564
  ],
565
565
  };
566
566
  }
567
567
  else {
568
- throw new McpError(ErrorCode.MethodNotFound, `未知的工具: ${name}`);
568
+ throw new McpError(ErrorCode.MethodNotFound, `Unknown tool: ${name}`);
569
569
  }
570
570
  }
571
571
  catch (error) {
572
572
  if (error instanceof McpError) {
573
573
  throw error;
574
574
  }
575
- throw new McpError(ErrorCode.InternalError, `工具执行失败: ${error instanceof Error ? error.message : String(error)}`);
575
+ throw new McpError(ErrorCode.InternalError, `Tool execution failed: ${error instanceof Error ? error.message : String(error)}`);
576
576
  }
577
577
  });
578
- // 启动服务器
578
+ // Start server
579
579
  async function main() {
580
580
  const transport = new StdioServerTransport();
581
581
  await server.connect(transport);
582
- console.error("MCP Web Reader v2.0 已启动(支持Jina Reader + Playwright");
582
+ console.error("MCP Web Reader v2.0 started (with Jina Reader + Playwright support)");
583
583
  }
584
- // 优雅关闭处理
584
+ // Graceful shutdown handling
585
585
  process.on('SIGINT', async () => {
586
- console.error("接收到SIGINT信号,正在关闭浏览器...");
586
+ console.error("Received SIGINT signal, closing browser...");
587
587
  await closeBrowser();
588
588
  process.exit(0);
589
589
  });
590
590
  process.on('SIGTERM', async () => {
591
- console.error("接收到SIGTERM信号,正在关闭浏览器...");
591
+ console.error("Received SIGTERM signal, closing browser...");
592
592
  await closeBrowser();
593
593
  process.exit(0);
594
594
  });
595
595
  main().catch((error) => {
596
- console.error("服务器启动失败:", error);
596
+ console.error("Server startup failed:", error);
597
597
  process.exit(1);
598
598
  });
package/package.json CHANGED
@@ -1,6 +1,6 @@
1
1
  {
2
2
  "name": "mcp-web-reader",
3
- "version": "2.0.0",
3
+ "version": "2.1.0",
4
4
  "description": "MCP server for reading web content with Jina Reader and local parser support",
5
5
  "main": "dist/index.js",
6
6
  "bin": {
@@ -11,7 +11,8 @@
11
11
  "build": "tsc",
12
12
  "start": "node dist/index.js",
13
13
  "dev": "tsc --watch",
14
- "claude-code": "node dist/index.js"
14
+ "claude-code": "node dist/index.js",
15
+ "postinstall": "npx playwright install chromium"
15
16
  },
16
17
  "repository": {
17
18
  "type": "git",