@j0hanz/fetch-url-mcp 0.0.1 → 0.0.2

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
package/README.md CHANGED
@@ -1,38 +1,30 @@
1
- <!-- markdownlint-disable MD033 -->
2
-
3
1
  # Fetch URL MCP Server
4
2
 
5
- <img src="assets/logo.svg" alt="Fetch URL MCP Logo" width="300">
6
-
7
- [![npm version](https://img.shields.io/npm/v/%40j0hanz%2Ffetch-url-mcp)](https://www.npmjs.com/package/@j0hanz/fetch-url-mcp) [![License: MIT](https://img.shields.io/badge/License-MIT-blue.svg)](https://opensource.org/licenses/MIT) [![Node.js](https://img.shields.io/badge/node-%3E%3D24-brightgreen)](https://nodejs.org) [![TypeScript](https://img.shields.io/badge/TypeScript-5.9-blue)](https://www.typescriptlang.org) [![MCP SDK](https://img.shields.io/badge/MCP%20SDK-1.26-purple)](https://modelcontextprotocol.io)
3
+ [![npm version](https://img.shields.io/npm/v/%40j0hanz%2Ffetch-url-mcp)](https://www.npmjs.com/package/@j0hanz/fetch-url-mcp) [![License: MIT](https://img.shields.io/badge/License-MIT-yellow.svg)](https://opensource.org/licenses/MIT) [![Node.js](https://img.shields.io/badge/node-%3E%3D24-3c873a)](https://nodejs.org) [![TypeScript](https://img.shields.io/badge/TypeScript-5.9-3178c6?logo=typescript&logoColor=white)](https://www.typescriptlang.org) [![MCP SDK](https://img.shields.io/badge/MCP%20SDK-1.26-7c3aed)](https://modelcontextprotocol.io)
8
4
 
9
5
  [![Install in VS Code](https://img.shields.io/badge/VS_Code-Install-0078d7?logo=visual-studio-code&logoColor=white)](https://insiders.vscode.dev/redirect?url=vscode%3Amcp%2Finstall%3F%7B%22name%22%3A%22fetch-url-mcp%22%2C%22command%22%3A%22npx%22%2C%22args%22%3A%5B%22-y%22%2C%22%40j0hanz%2Ffetch-url-mcp%40latest%22%2C%22--stdio%22%5D%7D) [![Install in VS Code Insiders](https://img.shields.io/badge/VS_Code_Insiders-Install-24bfa5?logo=visual-studio-code&logoColor=white)](https://insiders.vscode.dev/redirect?url=vscode-insiders%3Amcp%2Finstall%3F%7B%22name%22%3A%22fetch-url-mcp%22%2C%22command%22%3A%22npx%22%2C%22args%22%3A%5B%22-y%22%2C%22%40j0hanz%2Ffetch-url-mcp%40latest%22%2C%22--stdio%22%5D%7D) [![Install in Cursor](https://img.shields.io/badge/Cursor-Install-f97316?logo=cursor&logoColor=white)](https://cursor.com/install-mcp?name=fetch-url-mcp&config=eyJjb21tYW5kIjoibnB4IiwiYXJncyI6WyIteSIsIkBqMGhhbnovZmV0Y2gtdXJsLW1jcEBsYXRlc3QiLCItLXN0ZGlvIl19)
10
6
 
11
- Fetch and convert public web content to clean Markdown both readable by humans and optimized for LLM context.
7
+ Fetch public web pages and convert them into clean, AI-readable Markdown.
12
8
 
13
9
  ## Overview
14
10
 
15
- Fetch URL is a [Model Context Protocol](https://modelcontextprotocol.io) (MCP) server that fetches public web pages, extracts meaningful content using Mozilla's Readability algorithm, and converts the result into clean Markdown optimized for LLM context windows. It handles noise removal, caching, SSRF protection, async task execution, and supports both stdio and Streamable HTTP transports.
11
+ Fetch URL is a [Model Context Protocol](https://modelcontextprotocol.io) (MCP) server that fetches public web pages, extracts meaningful content using Mozilla's Readability algorithm, and converts the result into clean Markdown optimized for LLM context windows. It handles noise removal, caching, SSRF protection, async task execution, and supports both **stdio** and **Streamable HTTP** transports.
16
12
 
17
- ## Key Features
13
+ > [!NOTE]
14
+ > Content extraction quality varies depending on the HTML structure and complexity of the source page. Fetch URL works best with standard article and documentation layouts. Pages relying on client-side JavaScript rendering may yield incomplete results.
18
15
 
19
- - HTML to Markdown using Mozilla Readability + node-html-markdown.
20
- - Raw content URL rewriting for GitHub, GitLab, Bitbucket, and Gist.
21
- - In-memory LRU cache for faster repeat fetches.
22
- - Stdio or Streamable HTTP transport with session management.
23
- - SSRF protections: blocked private IP ranges and internal hostnames.
16
+ ## Key Features
24
17
 
25
- > **Note:** Content extraction quality varies depending on the HTML structure and
26
- > complexity of the source page. Fetch URL works best with standard article and
27
- > documentation layouts. Always verify the fetched content to ensure it meets
28
- > your expectations, as some pages may require manual adjustment or alternative
29
- > approaches.
18
+ - **HTML to Markdown** Content extraction via Mozilla Readability + node-html-markdown
19
+ - **Noise removal** Strips navigation, ads, cookie banners, and other non-content elements
20
+ - **In-memory LRU cache** Faster repeat fetches with configurable TTL (24 h default)
21
+ - **Raw URL rewriting** — Auto-converts GitHub, GitLab, Bitbucket, and Gist URLs to raw content endpoints
30
22
 
31
23
  ## Tech Stack
32
24
 
33
25
  | Component | Technology |
34
26
  | ------------------- | ----------------------------------- |
35
- | Runtime | Node.js 24 |
27
+ | Runtime | Node.js >= 24 |
36
28
  | Language | TypeScript 5.9 |
37
29
  | MCP SDK | `@modelcontextprotocol/sdk` ^1.26.0 |
38
30
  | Content Extraction | `@mozilla/readability` ^0.6.0 |
@@ -50,51 +42,31 @@ URL → Validate → DNS Preflight → HTTP Fetch → Decompress
50
42
  ```
51
43
 
52
44
  1. **URL Validation** — Normalize, block private hosts, transform raw-content URLs (GitHub, GitLab, Bitbucket)
53
- 2. **Fetch** — HTTP request via `undici` with redirect following, DNS preflight SSRF checks, and size limits
45
+ 2. **Fetch** — HTTP request with redirect following, DNS preflight SSRF checks, and size limits (10 MB)
54
46
  3. **Transform** — Offloaded to worker threads: parse HTML with `linkedom`, extract with Readability, remove DOM noise, convert to Markdown
55
- 4. **Cleanup** — Multi-pass Markdown normalization (heading promotion, spacing, skip-link removal, TypeDoc comment stripping)
56
- 5. **Cache + Respond** — Store result, apply inline content limits, return structured content
47
+ 4. **Cleanup** — Multi-pass Markdown normalization (heading promotion, spacing, skip-link removal)
48
+ 5. **Cache + Respond** — Store result in LRU cache, apply inline content limits, return structured content
57
49
 
58
50
  ## Repository Structure
59
51
 
60
52
  ```text
61
53
  fetch-url-mcp/
62
- ├── assets/
63
- │ └── logo.svg
64
- ├── scripts/
65
- │ ├── tasks.mjs
66
- │ └── validate-fetch.mjs
54
+ ├── assets/ # Server icon (logo.svg)
55
+ ├── scripts/ # Build & test orchestration
67
56
  ├── src/
68
- │ ├── workers/
69
- ├── transform-child.ts
70
- │ └── transform-worker.ts
71
- │ ├── cache.ts
72
- │ ├── config.ts
73
- │ ├── crypto.ts
74
- │ ├── dom-noise-removal.ts
75
- │ ├── errors.ts
76
- │ ├── fetch.ts
77
- │ ├── host-normalization.ts
78
- │ ├── http-native.ts
79
- ├── index.ts
80
- ├── instructions.md
81
- │ ├── ip-blocklist.ts
82
- │ ├── json.ts
83
- │ ├── language-detection.ts
84
- │ ├── markdown-cleanup.ts
85
- │ ├── mcp-validator.ts
86
- │ ├── mcp.ts
87
- │ ├── observability.ts
88
- │ ├── server-tuning.ts
89
- │ ├── session.ts
90
- │ ├── tasks.ts
91
- │ ├── timer-utils.ts
92
- │ ├── tools.ts
93
- │ ├── transform-types.ts
94
- │ ├── transform.ts
95
- │ └── type-guards.ts
96
- ├── tests/
97
- │ └── *.test.ts
57
+ │ ├── workers/ # Worker-thread child for HTML transforms
58
+ │ ├── index.ts # CLI entrypoint, transport wiring, shutdown
59
+ ├── server.ts # McpServer lifecycle and registration
60
+ │ ├── tools.ts # fetch-url tool definition and pipeline
61
+ │ ├── fetch.ts # URL normalization, SSRF, HTTP fetch
62
+ │ ├── transform.ts # HTML-to-Markdown pipeline, worker pool
63
+ │ ├── config.ts # Env-driven configuration
64
+ │ ├── resources.ts # MCP resource/template registration
65
+ │ ├── prompts.ts # MCP prompt registration (get-help)
66
+ │ ├── mcp.ts # Task execution management
67
+ │ ├── http-native.ts # Streamable HTTP server, auth, sessions
68
+ └── instructions.md # Server instructions embedded at runtime
69
+ ├── tests/ # Unit/integration tests (Node.js test runner)
98
70
  ├── package.json
99
71
  ├── tsconfig.json
100
72
  └── AGENTS.md
@@ -102,7 +74,7 @@ fetch-url-mcp/
102
74
 
103
75
  ## Requirements
104
76
 
105
- - **Node.js** 24
77
+ - **Node.js** >= 24
106
78
 
107
79
  ## Quickstart
108
80
 
@@ -150,6 +122,12 @@ npm run build
150
122
  node dist/index.js --stdio
151
123
  ```
152
124
 
125
+ ### Docker
126
+
127
+ ```bash
128
+ docker compose up --build
129
+ ```
130
+
153
131
  ## Configuration
154
132
 
155
133
  ### Runtime Modes
@@ -166,18 +144,23 @@ When no `--stdio` flag is passed, the server starts in **HTTP mode** (Streamable
166
144
 
167
145
  #### Core Settings
168
146
 
169
- | Variable | Default | Description |
170
- | --------------------- | ------------------------- | ------------------------------------------------------ |
171
- | `HOST` | `127.0.0.1` | HTTP server bind address |
172
- | `PORT` | `3000` | HTTP server port (1024–65535) |
173
- | `LOG_LEVEL` | `info` | Log level: `debug`, `info`, `warn`, `error` |
174
- | `FETCH_TIMEOUT_MS` | `15000` | HTTP fetch timeout in ms (1000–60000) |
175
- | `CACHE_ENABLED` | `true` | Enable/disable in-memory content cache |
176
- | `USER_AGENT` | `fetch-url-mcp/{version}` | Custom User-Agent header |
177
- | `ALLOW_REMOTE` | `false` | Allow remote connections in HTTP mode |
178
- | `ALLOWED_HOSTS` | _(empty)_ | Comma-separated host/origin allowlist for HTTP mode |
179
- | `TASKS_MAX_TOTAL` | `5000` | Maximum retained task records across all owners |
180
- | `TASKS_MAX_PER_OWNER` | `1000` | Maximum retained task records per session/client/token |
147
+ | Variable | Default | Description |
148
+ | ------------------ | ------------------------- | --------------------------------------------------- |
149
+ | `HOST` | `127.0.0.1` | HTTP server bind address |
150
+ | `PORT` | `3000` | HTTP server port (1024–65535) |
151
+ | `LOG_LEVEL` | `info` | Log level: `debug`, `info`, `warn`, `error` |
152
+ | `FETCH_TIMEOUT_MS` | `15000` | HTTP fetch timeout in ms (1000–60000) |
153
+ | `CACHE_ENABLED` | `true` | Enable/disable in-memory content cache |
154
+ | `USER_AGENT` | `fetch-url-mcp/{version}` | Custom User-Agent header |
155
+ | `ALLOW_REMOTE` | `false` | Allow remote connections in HTTP mode |
156
+ | `ALLOWED_HOSTS` | _(empty)_ | Comma-separated host/origin allowlist for HTTP mode |
157
+
158
+ #### Task Management
159
+
160
+ | Variable | Default | Description |
161
+ | --------------------- | ------- | ------------------------------------------------ |
162
+ | `TASKS_MAX_TOTAL` | `5000` | Maximum retained task records across all owners |
163
+ | `TASKS_MAX_PER_OWNER` | `1000` | Maximum retained task records per session/client |
181
164
 
182
165
  #### Authentication (HTTP Mode)
183
166
 
@@ -282,6 +265,7 @@ Fetches a webpage and converts it to clean Markdown format optimized for LLM con
282
265
  **Limitations:**
283
266
 
284
267
  - Does not execute complex client-side JavaScript interactions
268
+ - Inline output may be truncated when `MAX_INLINE_CONTENT_CHARS` is set
285
269
 
286
270
  ##### Parameters
287
271
 
@@ -297,31 +281,43 @@ Fetches a webpage and converts it to clean Markdown format optimized for LLM con
297
281
  ```json
298
282
  {
299
283
  "url": "https://example.com",
284
+ "inputUrl": "https://example.com",
300
285
  "resolvedUrl": "https://example.com",
301
286
  "finalUrl": "https://example.com",
302
- "inputUrl": "https://example.com",
303
287
  "title": "Example Domain",
288
+ "metadata": {
289
+ "title": "Example Domain",
290
+ "description": "...",
291
+ "author": "...",
292
+ "image": "...",
293
+ "favicon": "...",
294
+ "publishedAt": "...",
295
+ "modifiedAt": "..."
296
+ },
304
297
  "markdown": "# Example Domain\n\nThis domain is for use in illustrative examples...",
298
+ "fromCache": false,
299
+ "fetchedAt": "2026-02-11T12:00:00.000Z",
300
+ "contentSize": 1234,
305
301
  "truncated": false
306
302
  }
307
303
  ```
308
304
 
309
- | Field | Type | Description |
310
- | ------------- | ---------- | -------------------------------------------------- |
311
- | `url` | `string` | The canonical URL (pre-raw-transform) |
312
- | `inputUrl` | `string` | The original URL provided by the caller |
313
- | `resolvedUrl` | `string` | The normalized/transformed URL that was fetched |
314
- | `finalUrl` | `string?` | Final response URL after redirects |
315
- | `title` | `string?` | Extracted page title |
316
- | `metadata` | `object?` | Extracted metadata (title, description, author...) |
317
- | `markdown` | `string?` | Extracted content in Markdown format |
318
- | `fromCache` | `boolean?` | Whether the response was served from cache |
319
- | `fetchedAt` | `string?` | ISO timestamp for fetch/cache retrieval |
320
- | `contentSize` | `number?` | Full markdown size before inline truncation |
321
- | `truncated` | `boolean?` | Whether inline markdown was truncated |
322
- | `error` | `string?` | Error message if the request failed |
323
- | `statusCode` | `number?` | HTTP status code for failed requests |
324
- | `details` | `object?` | Additional error details |
305
+ | Field | Type | Description |
306
+ | ------------- | ---------- | ---------------------------------------------------------------------------------------- |
307
+ | `url` | `string` | The canonical URL (pre-raw-transform) |
308
+ | `inputUrl` | `string?` | The original URL provided by the caller |
309
+ | `resolvedUrl` | `string?` | The normalized/transformed URL that was fetched |
310
+ | `finalUrl` | `string?` | Final response URL after redirects |
311
+ | `title` | `string?` | Extracted page title |
312
+ | `metadata` | `object?` | Extracted metadata (title, description, author, image, favicon, publishedAt, modifiedAt) |
313
+ | `markdown` | `string?` | Extracted content in Markdown format |
314
+ | `fromCache` | `boolean?` | Whether the response was served from cache |
315
+ | `fetchedAt` | `string?` | ISO timestamp for fetch/cache retrieval |
316
+ | `contentSize` | `number?` | Full markdown size before inline truncation |
317
+ | `truncated` | `boolean?` | Whether inline markdown was truncated |
318
+ | `error` | `string?` | Error message if the request failed |
319
+ | `statusCode` | `number?` | HTTP status code for failed requests |
320
+ | `details` | `object?` | Additional error details |
325
321
 
326
322
  ##### Annotations
327
323
 
@@ -334,7 +330,7 @@ Fetches a webpage and converts it to clean Markdown format optimized for LLM con
334
330
 
335
331
  ##### Async Task Execution
336
332
 
337
- The `fetch-url` tool supports optional async task execution. Include a `task` field in the tool call to run the fetch in the background:
333
+ The `fetch-url` tool supports optional async task execution (`execution.taskSupport: "optional"`). Include a `task` field in the tool call to run the fetch in the background:
338
334
 
339
335
  ```json
340
336
  {
@@ -351,9 +347,9 @@ Then poll `tasks/get` until the task status is `completed` or `failed`, and retr
351
347
 
352
348
  ### Prompts
353
349
 
354
- | Name | Description |
355
- | ---------- | ------------------------------------------------ |
356
- | `get-help` | Returns server usage guidance and workflow hints |
350
+ | Name | Description |
351
+ | ---------- | --------------------------------- |
352
+ | `get-help` | Returns server usage instructions |
357
353
 
358
354
  ### Resources
359
355
 
@@ -362,12 +358,6 @@ Then poll `tasks/get` until the task status is `completed` or `failed`, and retr
362
358
  | `internal://instructions` | `text/markdown` | Server instructions and usage guidance |
363
359
  | `internal://cache/{namespace}/{hash}` | `text/markdown` | Cached markdown entries from prior `fetch-url` calls |
364
360
 
365
- ### Completions
366
-
367
- - `completion/complete` supports `internal://cache/{namespace}/{hash}` template variables:
368
- - `namespace`
369
- - `hash` (optionally filtered by `context.arguments.namespace`)
370
-
371
361
  ### Tasks
372
362
 
373
363
  The server declares full MCP task support:
@@ -479,6 +469,37 @@ Add to your Windsurf MCP configuration:
479
469
 
480
470
  </details>
481
471
 
472
+ <details>
473
+ <summary>Docker</summary>
474
+
475
+ Use the published image from GitHub Container Registry:
476
+
477
+ ```json
478
+ {
479
+ "mcpServers": {
480
+ "fetch-url-mcp": {
481
+ "command": "docker",
482
+ "args": [
483
+ "run",
484
+ "-i",
485
+ "--rm",
486
+ "ghcr.io/j0hanz/fetch-url-mcp:latest",
487
+ "--stdio"
488
+ ]
489
+ }
490
+ }
491
+ }
492
+ ```
493
+
494
+ Or build and run locally:
495
+
496
+ ```bash
497
+ docker build -t fetch-url-mcp .
498
+ docker run -i --rm fetch-url-mcp --stdio
499
+ ```
500
+
501
+ </details>
502
+
482
503
  ## Security
483
504
 
484
505
  ### SSRF Protection
@@ -486,7 +507,7 @@ Add to your Windsurf MCP configuration:
486
507
  Fetch URL blocks requests to private and internal network addresses:
487
508
 
488
509
  - **Blocked hosts**: `localhost`, `127.0.0.0/8`, `10.0.0.0/8`, `172.16.0.0/12`, `192.168.0.0/16`, `169.254.0.0/16`, `100.64.0.0/10`
489
- - **Blocked IPv6**: `::1`, `fc00::/7`, `fe80::/10`, IPv4-mapped private addresses (`::ffff:10.*`, etc.)
510
+ - **Blocked IPv6**: `::1`, `fc00::/7`, `fe80::/10`, IPv4-mapped private addresses
490
511
  - **Cloud metadata**: `169.254.169.254` (AWS), `metadata.google.internal`, `metadata.azure.com`, `100.100.100.200` (Azure IMDS)
491
512
 
492
513
  DNS preflight checks run on every redirect hop to prevent DNS rebinding attacks.
@@ -532,12 +553,12 @@ npm install
532
553
  ## Build and Release
533
554
 
534
555
  ```bash
535
- npm run build # Clean → Compile → Copy Assets → chmod
556
+ npm run build # Clean → Compile → Copy Assets → chmod
536
557
  npm run prepublishOnly # Lint → Type-Check → Build
537
- npm publish # Publish to npm
558
+ npm publish # Publish to npm
538
559
  ```
539
560
 
540
- The `prepare` script runs `npm run build` automatically on `npm install` from source.
561
+ CI/CD is handled via a GitHub Actions workflow (`release.yml`) that runs lint, type-check, test, build, and publishes to npm with version bumping.
541
562
 
542
563
  ## Troubleshooting
543
564
 
@@ -549,17 +570,15 @@ Use the built-in inspector to test the server interactively:
549
570
  npm run inspector
550
571
  ```
551
572
 
552
- This builds the project and launches `@modelcontextprotocol/inspector` pointing to the compiled server.
553
-
554
573
  ### Common Issues
555
574
 
556
- | Issue | Solution |
557
- | ------------------------- | ----------------------------------------------------------------------------------------------------------- |
558
- | `VALIDATION_ERROR` on URL | URL is blocked (private IP/localhost) or malformed. Do not retry. |
559
- | `queue_full` error | Usually auto-handled by worker fallback to in-process transform; if surfaced, retry or use async task mode. |
560
- | Garbled output | Binary content (images, PDFs) cannot be converted. Ensure the URL serves HTML. |
561
- | No output in stdio mode | Ensure `--stdio` flag is passed. Without it, the server starts in HTTP mode. |
562
- | Auth errors in HTTP mode | Set `ACCESS_TOKENS` or `API_KEY` env var and pass as `Authorization: Bearer <token>`. |
575
+ | Issue | Solution |
576
+ | ------------------------- | ------------------------------------------------------------------------------------- |
577
+ | `VALIDATION_ERROR` on URL | URL is blocked (private IP/localhost) or malformed. Do not retry. |
578
+ | `queue_full` error | Worker pool busy. Wait briefly, then retry or use async task mode. |
579
+ | Garbled output | Binary content (images, PDFs) cannot be converted. Ensure the URL serves HTML. |
580
+ | No output in stdio mode | Ensure `--stdio` flag is passed. Without it, the server starts in HTTP mode. |
581
+ | Auth errors in HTTP mode | Set `ACCESS_TOKENS` or `API_KEY` env var and pass as `Authorization: Bearer <token>`. |
563
582
 
564
583
  ### Stdout / Stderr Guidance
565
584