@j0hanz/fetch-url-mcp 0.0.1 → 0.0.2
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- package/README.md +129 -110
- package/dist/AGENTS.md +119 -84
- package/dist/assets/logo.svg +24836 -24836
- package/dist/index.js +0 -0
- package/dist/instructions.md +80 -27
- package/package.json +1 -1
package/README.md
CHANGED
|
@@ -1,38 +1,30 @@
|
|
|
1
|
-
<!-- markdownlint-disable MD033 -->
|
|
2
|
-
|
|
3
1
|
# Fetch URL MCP Server
|
|
4
2
|
|
|
5
|
-
|
|
6
|
-
|
|
7
|
-
[](https://www.npmjs.com/package/@j0hanz/fetch-url-mcp) [](https://opensource.org/licenses/MIT) [](https://nodejs.org) [](https://www.typescriptlang.org) [](https://modelcontextprotocol.io)
|
|
3
|
+
[](https://www.npmjs.com/package/@j0hanz/fetch-url-mcp) [](https://opensource.org/licenses/MIT) [](https://nodejs.org) [](https://www.typescriptlang.org) [](https://modelcontextprotocol.io)
|
|
8
4
|
|
|
9
5
|
[](https://insiders.vscode.dev/redirect?url=vscode%3Amcp%2Finstall%3F%7B%22name%22%3A%22fetch-url-mcp%22%2C%22command%22%3A%22npx%22%2C%22args%22%3A%5B%22-y%22%2C%22%40j0hanz%2Ffetch-url-mcp%40latest%22%2C%22--stdio%22%5D%7D) [](https://insiders.vscode.dev/redirect?url=vscode-insiders%3Amcp%2Finstall%3F%7B%22name%22%3A%22fetch-url-mcp%22%2C%22command%22%3A%22npx%22%2C%22args%22%3A%5B%22-y%22%2C%22%40j0hanz%2Ffetch-url-mcp%40latest%22%2C%22--stdio%22%5D%7D) [](https://cursor.com/install-mcp?name=fetch-url-mcp&config=eyJjb21tYW5kIjoibnB4IiwiYXJncyI6WyIteSIsIkBqMGhhbnovZmV0Y2gtdXJsLW1jcEBsYXRlc3QiLCItLXN0ZGlvIl19)
|
|
10
6
|
|
|
11
|
-
Fetch
|
|
7
|
+
Fetch public web pages and convert them into clean, AI-readable Markdown.
|
|
12
8
|
|
|
13
9
|
## Overview
|
|
14
10
|
|
|
15
|
-
Fetch URL is a [Model Context Protocol](https://modelcontextprotocol.io) (MCP) server that fetches public web pages, extracts meaningful content using Mozilla's Readability algorithm, and converts the result into clean Markdown optimized for LLM context windows. It handles noise removal, caching, SSRF protection, async task execution, and supports both stdio and Streamable HTTP transports.
|
|
11
|
+
Fetch URL is a [Model Context Protocol](https://modelcontextprotocol.io) (MCP) server that fetches public web pages, extracts meaningful content using Mozilla's Readability algorithm, and converts the result into clean Markdown optimized for LLM context windows. It handles noise removal, caching, SSRF protection, async task execution, and supports both **stdio** and **Streamable HTTP** transports.
|
|
16
12
|
|
|
17
|
-
|
|
13
|
+
> [!NOTE]
|
|
14
|
+
> Content extraction quality varies depending on the HTML structure and complexity of the source page. Fetch URL works best with standard article and documentation layouts. Pages relying on client-side JavaScript rendering may yield incomplete results.
|
|
18
15
|
|
|
19
|
-
|
|
20
|
-
- Raw content URL rewriting for GitHub, GitLab, Bitbucket, and Gist.
|
|
21
|
-
- In-memory LRU cache for faster repeat fetches.
|
|
22
|
-
- Stdio or Streamable HTTP transport with session management.
|
|
23
|
-
- SSRF protections: blocked private IP ranges and internal hostnames.
|
|
16
|
+
## Key Features
|
|
24
17
|
|
|
25
|
-
|
|
26
|
-
|
|
27
|
-
|
|
28
|
-
|
|
29
|
-
> approaches.
|
|
18
|
+
- **HTML to Markdown** — Content extraction via Mozilla Readability + node-html-markdown
|
|
19
|
+
- **Noise removal** — Strips navigation, ads, cookie banners, and other non-content elements
|
|
20
|
+
- **In-memory LRU cache** — Faster repeat fetches with configurable TTL (24 h default)
|
|
21
|
+
- **Raw URL rewriting** — Auto-converts GitHub, GitLab, Bitbucket, and Gist URLs to raw content endpoints
|
|
30
22
|
|
|
31
23
|
## Tech Stack
|
|
32
24
|
|
|
33
25
|
| Component | Technology |
|
|
34
26
|
| ------------------- | ----------------------------------- |
|
|
35
|
-
| Runtime | Node.js
|
|
27
|
+
| Runtime | Node.js >= 24 |
|
|
36
28
|
| Language | TypeScript 5.9 |
|
|
37
29
|
| MCP SDK | `@modelcontextprotocol/sdk` ^1.26.0 |
|
|
38
30
|
| Content Extraction | `@mozilla/readability` ^0.6.0 |
|
|
@@ -50,51 +42,31 @@ URL → Validate → DNS Preflight → HTTP Fetch → Decompress
|
|
|
50
42
|
```
|
|
51
43
|
|
|
52
44
|
1. **URL Validation** — Normalize, block private hosts, transform raw-content URLs (GitHub, GitLab, Bitbucket)
|
|
53
|
-
2. **Fetch** — HTTP request
|
|
45
|
+
2. **Fetch** — HTTP request with redirect following, DNS preflight SSRF checks, and size limits (10 MB)
|
|
54
46
|
3. **Transform** — Offloaded to worker threads: parse HTML with `linkedom`, extract with Readability, remove DOM noise, convert to Markdown
|
|
55
|
-
4. **Cleanup** — Multi-pass Markdown normalization (heading promotion, spacing, skip-link removal
|
|
56
|
-
5. **Cache + Respond** — Store result, apply inline content limits, return structured content
|
|
47
|
+
4. **Cleanup** — Multi-pass Markdown normalization (heading promotion, spacing, skip-link removal)
|
|
48
|
+
5. **Cache + Respond** — Store result in LRU cache, apply inline content limits, return structured content
|
|
57
49
|
|
|
58
50
|
## Repository Structure
|
|
59
51
|
|
|
60
52
|
```text
|
|
61
53
|
fetch-url-mcp/
|
|
62
|
-
├── assets/
|
|
63
|
-
|
|
64
|
-
├── scripts/
|
|
65
|
-
│ ├── tasks.mjs
|
|
66
|
-
│ └── validate-fetch.mjs
|
|
54
|
+
├── assets/ # Server icon (logo.svg)
|
|
55
|
+
├── scripts/ # Build & test orchestration
|
|
67
56
|
├── src/
|
|
68
|
-
│ ├── workers/
|
|
69
|
-
│
|
|
70
|
-
│
|
|
71
|
-
│ ├──
|
|
72
|
-
│ ├──
|
|
73
|
-
│ ├──
|
|
74
|
-
│ ├──
|
|
75
|
-
│ ├──
|
|
76
|
-
│ ├──
|
|
77
|
-
│ ├──
|
|
78
|
-
│ ├── http-native.ts
|
|
79
|
-
│
|
|
80
|
-
|
|
81
|
-
│ ├── ip-blocklist.ts
|
|
82
|
-
│ ├── json.ts
|
|
83
|
-
│ ├── language-detection.ts
|
|
84
|
-
│ ├── markdown-cleanup.ts
|
|
85
|
-
│ ├── mcp-validator.ts
|
|
86
|
-
│ ├── mcp.ts
|
|
87
|
-
│ ├── observability.ts
|
|
88
|
-
│ ├── server-tuning.ts
|
|
89
|
-
│ ├── session.ts
|
|
90
|
-
│ ├── tasks.ts
|
|
91
|
-
│ ├── timer-utils.ts
|
|
92
|
-
│ ├── tools.ts
|
|
93
|
-
│ ├── transform-types.ts
|
|
94
|
-
│ ├── transform.ts
|
|
95
|
-
│ └── type-guards.ts
|
|
96
|
-
├── tests/
|
|
97
|
-
│ └── *.test.ts
|
|
57
|
+
│ ├── workers/ # Worker-thread child for HTML transforms
|
|
58
|
+
│ ├── index.ts # CLI entrypoint, transport wiring, shutdown
|
|
59
|
+
│ ├── server.ts # McpServer lifecycle and registration
|
|
60
|
+
│ ├── tools.ts # fetch-url tool definition and pipeline
|
|
61
|
+
│ ├── fetch.ts # URL normalization, SSRF, HTTP fetch
|
|
62
|
+
│ ├── transform.ts # HTML-to-Markdown pipeline, worker pool
|
|
63
|
+
│ ├── config.ts # Env-driven configuration
|
|
64
|
+
│ ├── resources.ts # MCP resource/template registration
|
|
65
|
+
│ ├── prompts.ts # MCP prompt registration (get-help)
|
|
66
|
+
│ ├── mcp.ts # Task execution management
|
|
67
|
+
│ ├── http-native.ts # Streamable HTTP server, auth, sessions
|
|
68
|
+
│ └── instructions.md # Server instructions embedded at runtime
|
|
69
|
+
├── tests/ # Unit/integration tests (Node.js test runner)
|
|
98
70
|
├── package.json
|
|
99
71
|
├── tsconfig.json
|
|
100
72
|
└── AGENTS.md
|
|
@@ -102,7 +74,7 @@ fetch-url-mcp/
|
|
|
102
74
|
|
|
103
75
|
## Requirements
|
|
104
76
|
|
|
105
|
-
- **Node.js**
|
|
77
|
+
- **Node.js** >= 24
|
|
106
78
|
|
|
107
79
|
## Quickstart
|
|
108
80
|
|
|
@@ -150,6 +122,12 @@ npm run build
|
|
|
150
122
|
node dist/index.js --stdio
|
|
151
123
|
```
|
|
152
124
|
|
|
125
|
+
### Docker
|
|
126
|
+
|
|
127
|
+
```bash
|
|
128
|
+
docker compose up --build
|
|
129
|
+
```
|
|
130
|
+
|
|
153
131
|
## Configuration
|
|
154
132
|
|
|
155
133
|
### Runtime Modes
|
|
@@ -166,18 +144,23 @@ When no `--stdio` flag is passed, the server starts in **HTTP mode** (Streamable
|
|
|
166
144
|
|
|
167
145
|
#### Core Settings
|
|
168
146
|
|
|
169
|
-
| Variable
|
|
170
|
-
|
|
|
171
|
-
| `HOST`
|
|
172
|
-
| `PORT`
|
|
173
|
-
| `LOG_LEVEL`
|
|
174
|
-
| `FETCH_TIMEOUT_MS`
|
|
175
|
-
| `CACHE_ENABLED`
|
|
176
|
-
| `USER_AGENT`
|
|
177
|
-
| `ALLOW_REMOTE`
|
|
178
|
-
| `ALLOWED_HOSTS`
|
|
179
|
-
|
|
180
|
-
|
|
147
|
+
| Variable | Default | Description |
|
|
148
|
+
| ------------------ | ------------------------- | --------------------------------------------------- |
|
|
149
|
+
| `HOST` | `127.0.0.1` | HTTP server bind address |
|
|
150
|
+
| `PORT` | `3000` | HTTP server port (1024–65535) |
|
|
151
|
+
| `LOG_LEVEL` | `info` | Log level: `debug`, `info`, `warn`, `error` |
|
|
152
|
+
| `FETCH_TIMEOUT_MS` | `15000` | HTTP fetch timeout in ms (1000–60000) |
|
|
153
|
+
| `CACHE_ENABLED` | `true` | Enable/disable in-memory content cache |
|
|
154
|
+
| `USER_AGENT` | `fetch-url-mcp/{version}` | Custom User-Agent header |
|
|
155
|
+
| `ALLOW_REMOTE` | `false` | Allow remote connections in HTTP mode |
|
|
156
|
+
| `ALLOWED_HOSTS` | _(empty)_ | Comma-separated host/origin allowlist for HTTP mode |
|
|
157
|
+
|
|
158
|
+
#### Task Management
|
|
159
|
+
|
|
160
|
+
| Variable | Default | Description |
|
|
161
|
+
| --------------------- | ------- | ------------------------------------------------ |
|
|
162
|
+
| `TASKS_MAX_TOTAL` | `5000` | Maximum retained task records across all owners |
|
|
163
|
+
| `TASKS_MAX_PER_OWNER` | `1000` | Maximum retained task records per session/client |
|
|
181
164
|
|
|
182
165
|
#### Authentication (HTTP Mode)
|
|
183
166
|
|
|
@@ -282,6 +265,7 @@ Fetches a webpage and converts it to clean Markdown format optimized for LLM con
|
|
|
282
265
|
**Limitations:**
|
|
283
266
|
|
|
284
267
|
- Does not execute complex client-side JavaScript interactions
|
|
268
|
+
- Inline output may be truncated when `MAX_INLINE_CONTENT_CHARS` is set
|
|
285
269
|
|
|
286
270
|
##### Parameters
|
|
287
271
|
|
|
@@ -297,31 +281,43 @@ Fetches a webpage and converts it to clean Markdown format optimized for LLM con
|
|
|
297
281
|
```json
|
|
298
282
|
{
|
|
299
283
|
"url": "https://example.com",
|
|
284
|
+
"inputUrl": "https://example.com",
|
|
300
285
|
"resolvedUrl": "https://example.com",
|
|
301
286
|
"finalUrl": "https://example.com",
|
|
302
|
-
"inputUrl": "https://example.com",
|
|
303
287
|
"title": "Example Domain",
|
|
288
|
+
"metadata": {
|
|
289
|
+
"title": "Example Domain",
|
|
290
|
+
"description": "...",
|
|
291
|
+
"author": "...",
|
|
292
|
+
"image": "...",
|
|
293
|
+
"favicon": "...",
|
|
294
|
+
"publishedAt": "...",
|
|
295
|
+
"modifiedAt": "..."
|
|
296
|
+
},
|
|
304
297
|
"markdown": "# Example Domain\n\nThis domain is for use in illustrative examples...",
|
|
298
|
+
"fromCache": false,
|
|
299
|
+
"fetchedAt": "2026-02-11T12:00:00.000Z",
|
|
300
|
+
"contentSize": 1234,
|
|
305
301
|
"truncated": false
|
|
306
302
|
}
|
|
307
303
|
```
|
|
308
304
|
|
|
309
|
-
| Field | Type | Description
|
|
310
|
-
| ------------- | ---------- |
|
|
311
|
-
| `url` | `string` | The canonical URL (pre-raw-transform)
|
|
312
|
-
| `inputUrl` | `string
|
|
313
|
-
| `resolvedUrl` | `string
|
|
314
|
-
| `finalUrl` | `string?` | Final response URL after redirects
|
|
315
|
-
| `title` | `string?` | Extracted page title
|
|
316
|
-
| `metadata` | `object?` | Extracted metadata (title, description, author
|
|
317
|
-
| `markdown` | `string?` | Extracted content in Markdown format
|
|
318
|
-
| `fromCache` | `boolean?` | Whether the response was served from cache
|
|
319
|
-
| `fetchedAt` | `string?` | ISO timestamp for fetch/cache retrieval
|
|
320
|
-
| `contentSize` | `number?` | Full markdown size before inline truncation
|
|
321
|
-
| `truncated` | `boolean?` | Whether inline markdown was truncated
|
|
322
|
-
| `error` | `string?` | Error message if the request failed
|
|
323
|
-
| `statusCode` | `number?` | HTTP status code for failed requests
|
|
324
|
-
| `details` | `object?` | Additional error details
|
|
305
|
+
| Field | Type | Description |
|
|
306
|
+
| ------------- | ---------- | ---------------------------------------------------------------------------------------- |
|
|
307
|
+
| `url` | `string` | The canonical URL (pre-raw-transform) |
|
|
308
|
+
| `inputUrl` | `string?` | The original URL provided by the caller |
|
|
309
|
+
| `resolvedUrl` | `string?` | The normalized/transformed URL that was fetched |
|
|
310
|
+
| `finalUrl` | `string?` | Final response URL after redirects |
|
|
311
|
+
| `title` | `string?` | Extracted page title |
|
|
312
|
+
| `metadata` | `object?` | Extracted metadata (title, description, author, image, favicon, publishedAt, modifiedAt) |
|
|
313
|
+
| `markdown` | `string?` | Extracted content in Markdown format |
|
|
314
|
+
| `fromCache` | `boolean?` | Whether the response was served from cache |
|
|
315
|
+
| `fetchedAt` | `string?` | ISO timestamp for fetch/cache retrieval |
|
|
316
|
+
| `contentSize` | `number?` | Full markdown size before inline truncation |
|
|
317
|
+
| `truncated` | `boolean?` | Whether inline markdown was truncated |
|
|
318
|
+
| `error` | `string?` | Error message if the request failed |
|
|
319
|
+
| `statusCode` | `number?` | HTTP status code for failed requests |
|
|
320
|
+
| `details` | `object?` | Additional error details |
|
|
325
321
|
|
|
326
322
|
##### Annotations
|
|
327
323
|
|
|
@@ -334,7 +330,7 @@ Fetches a webpage and converts it to clean Markdown format optimized for LLM con
|
|
|
334
330
|
|
|
335
331
|
##### Async Task Execution
|
|
336
332
|
|
|
337
|
-
The `fetch-url` tool supports optional async task execution. Include a `task` field in the tool call to run the fetch in the background:
|
|
333
|
+
The `fetch-url` tool supports optional async task execution (`execution.taskSupport: "optional"`). Include a `task` field in the tool call to run the fetch in the background:
|
|
338
334
|
|
|
339
335
|
```json
|
|
340
336
|
{
|
|
@@ -351,9 +347,9 @@ Then poll `tasks/get` until the task status is `completed` or `failed`, and retr
|
|
|
351
347
|
|
|
352
348
|
### Prompts
|
|
353
349
|
|
|
354
|
-
| Name | Description
|
|
355
|
-
| ---------- |
|
|
356
|
-
| `get-help` | Returns server usage
|
|
350
|
+
| Name | Description |
|
|
351
|
+
| ---------- | --------------------------------- |
|
|
352
|
+
| `get-help` | Returns server usage instructions |
|
|
357
353
|
|
|
358
354
|
### Resources
|
|
359
355
|
|
|
@@ -362,12 +358,6 @@ Then poll `tasks/get` until the task status is `completed` or `failed`, and retr
|
|
|
362
358
|
| `internal://instructions` | `text/markdown` | Server instructions and usage guidance |
|
|
363
359
|
| `internal://cache/{namespace}/{hash}` | `text/markdown` | Cached markdown entries from prior `fetch-url` calls |
|
|
364
360
|
|
|
365
|
-
### Completions
|
|
366
|
-
|
|
367
|
-
- `completion/complete` supports `internal://cache/{namespace}/{hash}` template variables:
|
|
368
|
-
- `namespace`
|
|
369
|
-
- `hash` (optionally filtered by `context.arguments.namespace`)
|
|
370
|
-
|
|
371
361
|
### Tasks
|
|
372
362
|
|
|
373
363
|
The server declares full MCP task support:
|
|
@@ -479,6 +469,37 @@ Add to your Windsurf MCP configuration:
|
|
|
479
469
|
|
|
480
470
|
</details>
|
|
481
471
|
|
|
472
|
+
<details>
|
|
473
|
+
<summary>Docker</summary>
|
|
474
|
+
|
|
475
|
+
Use the published image from GitHub Container Registry:
|
|
476
|
+
|
|
477
|
+
```json
|
|
478
|
+
{
|
|
479
|
+
"mcpServers": {
|
|
480
|
+
"fetch-url-mcp": {
|
|
481
|
+
"command": "docker",
|
|
482
|
+
"args": [
|
|
483
|
+
"run",
|
|
484
|
+
"-i",
|
|
485
|
+
"--rm",
|
|
486
|
+
"ghcr.io/j0hanz/fetch-url-mcp:latest",
|
|
487
|
+
"--stdio"
|
|
488
|
+
]
|
|
489
|
+
}
|
|
490
|
+
}
|
|
491
|
+
}
|
|
492
|
+
```
|
|
493
|
+
|
|
494
|
+
Or build and run locally:
|
|
495
|
+
|
|
496
|
+
```bash
|
|
497
|
+
docker build -t fetch-url-mcp .
|
|
498
|
+
docker run -i --rm fetch-url-mcp --stdio
|
|
499
|
+
```
|
|
500
|
+
|
|
501
|
+
</details>
|
|
502
|
+
|
|
482
503
|
## Security
|
|
483
504
|
|
|
484
505
|
### SSRF Protection
|
|
@@ -486,7 +507,7 @@ Add to your Windsurf MCP configuration:
|
|
|
486
507
|
Fetch URL blocks requests to private and internal network addresses:
|
|
487
508
|
|
|
488
509
|
- **Blocked hosts**: `localhost`, `127.0.0.0/8`, `10.0.0.0/8`, `172.16.0.0/12`, `192.168.0.0/16`, `169.254.0.0/16`, `100.64.0.0/10`
|
|
489
|
-
- **Blocked IPv6**: `::1`, `fc00::/7`, `fe80::/10`, IPv4-mapped private addresses
|
|
510
|
+
- **Blocked IPv6**: `::1`, `fc00::/7`, `fe80::/10`, IPv4-mapped private addresses
|
|
490
511
|
- **Cloud metadata**: `169.254.169.254` (AWS), `metadata.google.internal`, `metadata.azure.com`, `100.100.100.200` (Azure IMDS)
|
|
491
512
|
|
|
492
513
|
DNS preflight checks run on every redirect hop to prevent DNS rebinding attacks.
|
|
@@ -532,12 +553,12 @@ npm install
|
|
|
532
553
|
## Build and Release
|
|
533
554
|
|
|
534
555
|
```bash
|
|
535
|
-
npm run build
|
|
556
|
+
npm run build # Clean → Compile → Copy Assets → chmod
|
|
536
557
|
npm run prepublishOnly # Lint → Type-Check → Build
|
|
537
|
-
npm publish
|
|
558
|
+
npm publish # Publish to npm
|
|
538
559
|
```
|
|
539
560
|
|
|
540
|
-
|
|
561
|
+
CI/CD is handled via a GitHub Actions workflow (`release.yml`) that runs lint, type-check, test, build, and publishes to npm with version bumping.
|
|
541
562
|
|
|
542
563
|
## Troubleshooting
|
|
543
564
|
|
|
@@ -549,17 +570,15 @@ Use the built-in inspector to test the server interactively:
|
|
|
549
570
|
npm run inspector
|
|
550
571
|
```
|
|
551
572
|
|
|
552
|
-
This builds the project and launches `@modelcontextprotocol/inspector` pointing to the compiled server.
|
|
553
|
-
|
|
554
573
|
### Common Issues
|
|
555
574
|
|
|
556
|
-
| Issue | Solution
|
|
557
|
-
| ------------------------- |
|
|
558
|
-
| `VALIDATION_ERROR` on URL | URL is blocked (private IP/localhost) or malformed. Do not retry.
|
|
559
|
-
| `queue_full` error |
|
|
560
|
-
| Garbled output | Binary content (images, PDFs) cannot be converted. Ensure the URL serves HTML.
|
|
561
|
-
| No output in stdio mode | Ensure `--stdio` flag is passed. Without it, the server starts in HTTP mode.
|
|
562
|
-
| Auth errors in HTTP mode | Set `ACCESS_TOKENS` or `API_KEY` env var and pass as `Authorization: Bearer <token>`.
|
|
575
|
+
| Issue | Solution |
|
|
576
|
+
| ------------------------- | ------------------------------------------------------------------------------------- |
|
|
577
|
+
| `VALIDATION_ERROR` on URL | URL is blocked (private IP/localhost) or malformed. Do not retry. |
|
|
578
|
+
| `queue_full` error | Worker pool busy. Wait briefly, then retry or use async task mode. |
|
|
579
|
+
| Garbled output | Binary content (images, PDFs) cannot be converted. Ensure the URL serves HTML. |
|
|
580
|
+
| No output in stdio mode | Ensure `--stdio` flag is passed. Without it, the server starts in HTTP mode. |
|
|
581
|
+
| Auth errors in HTTP mode | Set `ACCESS_TOKENS` or `API_KEY` env var and pass as `Authorization: Bearer <token>`. |
|
|
563
582
|
|
|
564
583
|
### Stdout / Stderr Guidance
|
|
565
584
|
|