xcrawl-mcp 1.0.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
Files changed (51) hide show
  1. package/.editorconfig +12 -0
  2. package/.env.example +3 -0
  3. package/.prettierrc +6 -0
  4. package/README.md +244 -0
  5. package/claude.md +295 -0
  6. package/dist/core/crawl.d.ts +246 -0
  7. package/dist/core/crawl.d.ts.map +1 -0
  8. package/dist/core/crawl.js +141 -0
  9. package/dist/core/crawl.js.map +1 -0
  10. package/dist/core/map.d.ts +34 -0
  11. package/dist/core/map.d.ts.map +1 -0
  12. package/dist/core/map.js +50 -0
  13. package/dist/core/map.js.map +1 -0
  14. package/dist/core/scrape.d.ts +201 -0
  15. package/dist/core/scrape.d.ts.map +1 -0
  16. package/dist/core/scrape.js +148 -0
  17. package/dist/core/scrape.js.map +1 -0
  18. package/dist/core/search.d.ts +144 -0
  19. package/dist/core/search.d.ts.map +1 -0
  20. package/dist/core/search.js +75 -0
  21. package/dist/core/search.js.map +1 -0
  22. package/dist/index.d.ts +8 -0
  23. package/dist/index.d.ts.map +1 -0
  24. package/dist/index.js +516 -0
  25. package/dist/index.js.map +1 -0
  26. package/dist/stdio.d.ts +3 -0
  27. package/dist/stdio.d.ts.map +1 -0
  28. package/dist/stdio.js +551 -0
  29. package/dist/stdio.js.map +1 -0
  30. package/dist/tools.d.ts +540 -0
  31. package/dist/tools.d.ts.map +1 -0
  32. package/dist/tools.js +528 -0
  33. package/dist/tools.js.map +1 -0
  34. package/dist/types.d.ts +214 -0
  35. package/dist/types.d.ts.map +1 -0
  36. package/dist/types.js +5 -0
  37. package/dist/types.js.map +1 -0
  38. package/package.json +33 -0
  39. package/src/core/crawl.ts +149 -0
  40. package/src/core/map.ts +56 -0
  41. package/src/core/scrape.ts +156 -0
  42. package/src/core/search.ts +81 -0
  43. package/src/index.ts +565 -0
  44. package/src/stdio.ts +584 -0
  45. package/src/tools.ts +539 -0
  46. package/src/types.ts +221 -0
  47. package/tsconfig.build.json +14 -0
  48. package/tsconfig.json +45 -0
  49. package/vitest.config.mts +11 -0
  50. package/worker-configuration.d.ts +10848 -0
  51. package/wrangler.jsonc +26 -0
package/.editorconfig ADDED
@@ -0,0 +1,12 @@
1
+ # http://editorconfig.org
2
+ root = true
3
+
4
+ [*]
5
+ indent_style = tab
6
+ end_of_line = lf
7
+ charset = utf-8
8
+ trim_trailing_whitespace = true
9
+ insert_final_newline = true
10
+
11
+ [*.yml]
12
+ indent_style = space
package/.env.example ADDED
@@ -0,0 +1,3 @@
1
+ # xCrawl API Key
2
+ # Get your API key from https://run.xcrawl.com
3
+ XCRAWL_API_KEY=your-xcrawl-api-key-here
package/.prettierrc ADDED
@@ -0,0 +1,6 @@
1
+ {
2
+ "printWidth": 140,
3
+ "singleQuote": true,
4
+ "semi": true,
5
+ "useTabs": true
6
+ }
package/README.md ADDED
@@ -0,0 +1,244 @@
1
+ # XCrawl MCP Server
2
+
3
+ Model Context Protocol (MCP) server for XCrawl. It exposes scraping, search, map, and crawl tools to MCP clients.
4
+
5
+ ## Table of Contents
6
+
7
+ - [Prerequisites](#prerequisites)
8
+ - [Quick Start (Stdio)](#quick-start-stdio)
9
+ - [Claude Desktop Configuration](#claude-desktop-configuration)
10
+ - [Cloudflare Workers Deployment](#cloudflare-workers-deployment)
11
+ - [Authentication](#authentication)
12
+ - [Available Tools](#available-tools)
13
+ - [Request Defaults](#request-defaults)
14
+ - [Error Format](#error-format)
15
+ - [Development](#development)
16
+ - [License](#license)
17
+ - [Support](#support)
18
+
19
+ ## Prerequisites
20
+
21
+ - Node.js 18+
22
+ - XCrawl API key from [xcrawl.com](https://xcrawl.com)
23
+ - Cloudflare account (only for Workers deployment)
24
+
25
+ ## Quick Start (Stdio)
26
+
27
+ Run directly with `npx`:
28
+
29
+ ```bash
30
+ XCRAWL_API_KEY=your-api-key npx -y xcrawl-mcp
31
+ ```
32
+
33
+ Or install globally:
34
+
35
+ ```bash
36
+ npm install -g xcrawl-mcp
37
+ XCRAWL_API_KEY=your-api-key xcrawl-mcp
38
+ ```
39
+
40
+ ## Claude Desktop Configuration
41
+
42
+ Add to Claude Desktop config:
43
+ - macOS: `~/Library/Application Support/Claude/claude_desktop_config.json`
44
+ - Windows: `%APPDATA%\Claude\claude_desktop_config.json`
45
+
46
+ ```json
47
+ {
48
+ "mcpServers": {
49
+ "xcrawl": {
50
+ "command": "npx",
51
+ "args": ["-y", "xcrawl-mcp"],
52
+ "env": {
53
+ "XCRAWL_API_KEY": "your-api-key"
54
+ }
55
+ }
56
+ }
57
+ }
58
+ ```
59
+
60
+ ## Cloudflare Workers Deployment
61
+
62
+ ```bash
63
+ git clone <your-repo>
64
+ cd xcrawl-mcp
65
+ npm install
66
+ npm run deploy
67
+ ```
68
+
69
+ After deployment:
70
+ - `https://xcrawl-mcp.<your-account>.workers.dev/mcp`
71
+ - `https://xcrawl-mcp.<your-account>.workers.dev/health`
72
+
73
+ ## Authentication
74
+
75
+ ### Stdio mode
76
+
77
+ Use environment variable:
78
+ - `XCRAWL_API_KEY` (required)
79
+
80
+ ### Workers mode
81
+
82
+ Pass API key in request headers (priority order):
83
+ 1. `Authorization: Bearer <api-key>`
84
+ 2. `x-api-key: <api-key>`
85
+ 3. `x-xcrawl-api-key: <api-key>`
86
+
87
+ ## Available Tools
88
+
89
+ | Tool | Purpose |
90
+ |------|---------|
91
+ | `xcrawl_scrape` | Scrape a single page |
92
+ | `xcrawl_check_status` | Check async scrape status |
93
+ | `xcrawl_search` | Google SERP search |
94
+ | `xcrawl_map` | Discover URLs from a site |
95
+ | `xcrawl_crawl` | Async multi-page crawl |
96
+ | `xcrawl_check_crawl_status` | Check async crawl status |
97
+
98
+ ### `xcrawl_scrape`
99
+
100
+ Basic request:
101
+
102
+ ```json
103
+ {
104
+ "url": "https://example.com"
105
+ }
106
+ ```
107
+
108
+ With extraction (`json.prompt` and `json.json_schema` are optional):
109
+
110
+ ```json
111
+ {
112
+ "url": "https://example.com",
113
+ "output": {
114
+ "formats": ["json"],
115
+ "json": {
116
+ "prompt": "Extract product name and price",
117
+ "json_schema": {
118
+ "type": "object",
119
+ "properties": {
120
+ "name": { "type": "string" },
121
+ "price": { "type": "number" }
122
+ }
123
+ }
124
+ }
125
+ }
126
+ }
127
+ ```
128
+
129
+ ### `xcrawl_check_status`
130
+
131
+ ```json
132
+ {
133
+ "scrape_id": "abc-123-def-456"
134
+ }
135
+ ```
136
+
137
+ Possible values: `pending`, `crawling`, `completed`, `failed`.
138
+
139
+ ### `xcrawl_search`
140
+
141
+ ```json
142
+ {
143
+ "query": "latest AI news",
144
+ "location": "New York",
145
+ "language": "en",
146
+ "limit": 10
147
+ }
148
+ ```
149
+
150
+ ### `xcrawl_map`
151
+
152
+ ```json
153
+ {
154
+ "url": "https://example.com",
155
+ "filter": "blog/.*",
156
+ "limit": 1000,
157
+ "include_subdomains": true,
158
+ "ignore_query_parameters": true
159
+ }
160
+ ```
161
+
162
+ ### `xcrawl_crawl`
163
+
164
+ ```json
165
+ {
166
+ "url": "https://example.com",
167
+ "crawler": {
168
+ "limit": 100,
169
+ "include": ["products/.*"],
170
+ "exclude": ["admin/.*", "login/.*"],
171
+ "max_depth": 3
172
+ },
173
+ "output": {
174
+ "formats": ["markdown"]
175
+ }
176
+ }
177
+ ```
178
+
179
+ ### `xcrawl_check_crawl_status`
180
+
181
+ ```json
182
+ {
183
+ "crawl_id": "xyz-789-abc-012"
184
+ }
185
+ ```
186
+
187
+ Possible values: `pending`, `crawling`, `completed`, `failed`.
188
+
189
+ ## Request Defaults
190
+
191
+ Common defaults:
192
+ Defaults may vary by API version.
193
+
194
+ | Parameter | Default | Notes |
195
+ |-----------|---------|-------|
196
+ | `mode` | `"sync"` | Scrape only |
197
+ | `proxy.location` | `"US"` | ISO-3166-1 alpha-2 |
198
+ | `request.device` | `"desktop"` | `desktop` or `mobile` |
199
+ | `request.only_main_content` | `true` | Main content filtering |
200
+ | `request.block_ads` | `true` | Ad blocking |
201
+ | `request.skip_tls_verification` | `true` | Skip TLS verification |
202
+ | `js_render.enabled` | `true` | JavaScript rendering |
203
+ | `js_render.wait_until` | `"load"` | `load`, `domcontentloaded`, `networkidle` |
204
+ | `output.formats` | `[]` | If omitted or `[]`, returns `metadata` only |
205
+ | `output.screenshot` | `"viewport"` | `viewport` or `full_page` |
206
+ | `output.json.prompt` | - | Optional |
207
+ | `output.json.json_schema` | - | Optional |
208
+ | `webhook.events` | `["started","completed","failed"]` | Optional callback events |
209
+
210
+ ## Error Format
211
+
212
+ Errors are returned in MCP format:
213
+
214
+ ```json
215
+ {
216
+ "content": [
217
+ {
218
+ "type": "text",
219
+ "text": "Error: XCrawl API error: 401 Unauthorized - Invalid API key"
220
+ }
221
+ ],
222
+ "isError": true
223
+ }
224
+ ```
225
+
226
+ ## Development
227
+
228
+ ```bash
229
+ npm install
230
+ npm run build
231
+ XCRAWL_API_KEY=your-key npm run start:stdio:dev
232
+
233
+ npm run dev
234
+ curl http://localhost:8787/health
235
+ ```
236
+
237
+ ## License
238
+
239
+ MIT
240
+
241
+ ## Support
242
+
243
+ - XCrawl API: [xcrawl.com](https://xcrawl.com)
244
+ - MCP Server: Open an issue in this repository
package/claude.md ADDED
@@ -0,0 +1,295 @@
1
+ # xCrawl MCP Server - Claude Documentation
2
+
3
+ ## Project Overview
4
+
5
+ **xCrawl MCP Server** is a Model Context Protocol (MCP) server that integrates the xCrawl web scraping API with AI models like Claude. It enables Claude to access web content, take screenshots, and extract structured data from any URL.
6
+
7
+ ### Key Information
8
+ - **Name**: xcrawl-mcp
9
+ - **Version**: 1.0.0
10
+ - **Type**: TypeScript/Node.js project with dual deployment modes
11
+ - **Protocol**: MCP (Model Context Protocol) 2024-11-05
12
+ - **API Integration**: xCrawl API (https://run.xcrawl.com)
13
+
14
+ ## Architecture
15
+
16
+ The project supports two deployment modes:
17
+
18
+ ### 1. stdio Mode (Local)
19
+ - Runs as a local Node.js process
20
+ - Communicates via standard input/output (stdio transport)
21
+ - Used with Claude Desktop and similar MCP clients
22
+ - Entry point: `src/stdio.ts`
23
+ - Requires `XCRAWL_API_KEY` environment variable
24
+
25
+ ### 2. HTTP Mode (Cloudflare Workers)
26
+ - Deployed to Cloudflare Workers edge network
27
+ - HTTP/JSON-RPC endpoint for remote access
28
+ - Entry point: `src/index.ts`
29
+ - API key provided via request headers (not environment variables)
30
+ - Custom domain: mcp.xcrawl.com
31
+
32
+ ## Project Structure
33
+
34
+ ```
35
+ xcrawl-mcp/
36
+ ├── src/
37
+ │ ├── index.ts # Cloudflare Workers entry (HTTP mode)
38
+ │ ├── stdio.ts # stdio mode entry point
39
+ │ ├── types.ts # TypeScript type definitions
40
+ │ ├── tools.ts # Shared MCP tool definitions
41
+ │ └── core/
42
+ │ └── scrape.ts # Core scraping logic and API calls
43
+ ├── package.json # Dependencies and scripts
44
+ ├── tsconfig.json # TypeScript configuration
45
+ ├── tsconfig.build.json # Build-specific TypeScript config
46
+ ├── wrangler.jsonc # Cloudflare Workers configuration
47
+ ├── README.md # User documentation
48
+ └── claude.md # This file - Claude-specific documentation
49
+ ```
50
+
51
+ ## Core Components
52
+
53
+ ### 1. MCP Tools (src/tools.ts)
54
+
55
+ The server provides two MCP tools:
56
+
57
+ #### `xcrawl_scrape`
58
+ Scrapes web pages with extensive configuration options.
59
+
60
+ **Key Features:**
61
+ - Multiple output formats: markdown, html, raw_html, links, summary, screenshot, json
62
+ - AI-powered structured data extraction
63
+ - Full browser JavaScript rendering
64
+ - Residential proxy support with location selection
65
+ - Sync/async execution modes
66
+ - Content filtering (ads, navigation removal)
67
+ - Full-page or viewport screenshots
68
+ - Custom headers and cookies
69
+
70
+ **Common Usage:**
71
+ ```json
72
+ {
73
+ "url": "https://example.com",
74
+ "output": { "formats": ["markdown"] }
75
+ }
76
+ ```
77
+
78
+ #### `xcrawl_check_status`
79
+ Checks status and retrieves results of async scrape tasks.
80
+
81
+ **Usage:**
82
+ ```json
83
+ {
84
+ "scrape_id": "job_abc123"
85
+ }
86
+ ```
87
+
88
+ ### 2. Core Scraping Logic (src/core/scrape.ts)
89
+
90
+ **Functions:**
91
+ - `callXCrawlAPI()` - Makes POST requests to xCrawl API
92
+ - `checkScrapeStatus()` - Polls job status for async tasks
93
+ - `formatScrapeResponse()` - Formats API responses for MCP
94
+
95
+ **Key Details:**
96
+ - API Endpoint: https://run.xcrawl.com/v1/scrape
97
+ - Status Endpoint: https://run.xcrawl.com/v1/jobs/{scrape_id}/result
98
+ - Timeout: 300 seconds (5 minutes)
99
+ - Uses Zod for request validation
100
+
101
+ ### 3. Type Definitions (src/types.ts)
102
+
103
+ **Main Types:**
104
+ - `XCrawlScrapeRequest` - Request parameters for scraping
105
+ - `XCrawlScrapeResponse` - API response structure
106
+
107
+ ### 4. Entry Points
108
+
109
+ #### stdio Mode (src/stdio.ts)
110
+ - Uses `@modelcontextprotocol/sdk` StdioServerTransport
111
+ - Reads API key from `XCRAWL_API_KEY` environment variable
112
+ - Validates requests with Zod schemas
113
+ - Runs as CLI tool
114
+
115
+ #### HTTP Mode (src/index.ts)
116
+ - Cloudflare Workers fetch handler
117
+ - Endpoints:
118
+ - `/mcp` - Main MCP JSON-RPC endpoint
119
+ - `/health` - Health check
120
+ - `/sse` - SSE endpoint (not implemented)
121
+ - Extracts API key from headers:
122
+ - `Authorization: Bearer <key>` (preferred)
123
+ - `x-api-key: <key>`
124
+ - `x-xcrawl-api-key: <key>`
125
+
126
+ ## Authentication
127
+
128
+ ### stdio Mode
129
+ - API key: `XCRAWL_API_KEY` environment variable
130
+ - Configured in MCP client settings
131
+
132
+ ### Cloudflare Workers
133
+ - API key: Request headers only
134
+ - No environment variables or secrets required
135
+ - Each request provides its own API key
136
+
137
+ ## Deployment
138
+
139
+ ### Local Development (stdio)
140
+ ```bash
141
+ npm install
142
+ npm run build
143
+ XCRAWL_API_KEY=your-key npm run start:stdio:dev
144
+ ```
145
+
146
+ ### Cloudflare Workers
147
+ ```bash
148
+ npm install
149
+ npm run deploy
150
+ ```
151
+
152
+ **Deployment Configuration (wrangler.jsonc):**
153
+ - Account ID: e579d774006254eebf5dd7dcb44bacd1
154
+ - Custom domain: mcp.xcrawl.com
155
+ - Compatibility date: 2025-12-24
156
+ - Minification: enabled
157
+ - Observability: traces and logs enabled
158
+
159
+ ## Key Features
160
+
161
+ ### Output Formats
162
+ - `html` - Cleaned HTML (no scripts, styles)
163
+ - `raw_html` - Original HTML with all tags
164
+ - `markdown` - Markdown conversion
165
+ - `links` - Extracted URLs
166
+ - `summary` - AI-generated summary
167
+ - `screenshot` - Image URL (full_page or viewport)
168
+ - `json` - AI-extracted structured data
169
+
170
+ ### AI-Powered Features
171
+ - **Structured Data Extraction**: Use natural language prompts to extract data
172
+ - **Smart Summarization**: AI-generated page summaries
173
+ - **Schema Validation**: Optional JSON Schema for strict output
174
+
175
+ ### Performance Features
176
+ - 300-second timeout for complex pages
177
+ - Async mode for long-running tasks
178
+ - Proxy support for geo-restricted content
179
+ - JavaScript rendering with configurable wait conditions
180
+
181
+ ## Response Structure
182
+
183
+ ### Successful Scrape Response
184
+ ```json
185
+ {
186
+ "scrape_id": "uuid",
187
+ "endpoint": "scrape",
188
+ "version": "string",
189
+ "status": "success",
190
+ "url": "https://example.com",
191
+ "data": {
192
+ "markdown": "# Content...",
193
+ "html": "cleaned HTML...",
194
+ "raw_html": "<!DOCTYPE html>...",
195
+ "links": ["https://..."],
196
+ "screenshot": "https://storage.url/screenshot.png",
197
+ "summary": "AI summary...",
198
+ "json": { "extracted": "data" },
199
+ "metadata": {
200
+ "title": "Page Title",
201
+ "status_code": 200,
202
+ "proxy_location": "US",
203
+ "proxy_sticky_session": "session-id"
204
+ }
205
+ },
206
+ "startedAt": "2025-12-24T00:00:00Z",
207
+ "endedAt": "2025-12-24T00:00:05Z"
208
+ }
209
+ ```
210
+
211
+ ### Async Task Response
212
+ ```json
213
+ {
214
+ "scrape_id": "uuid",
215
+ "status": "pending",
216
+ "message": "Task queued for processing"
217
+ }
218
+ ```
219
+
220
+ ## Dependencies
221
+
222
+ ### Production
223
+ - `@modelcontextprotocol/sdk` ^1.0.4 - MCP protocol implementation
224
+ - `zod` ^3.23.8 - Schema validation
225
+
226
+ ### Development
227
+ - `@cloudflare/vitest-pool-workers` - Workers testing
228
+ - `@cloudflare/workers-types` - TypeScript types for Workers
229
+ - `typescript` - TypeScript compiler
230
+ - `tsx` - TypeScript execution for development
231
+ - `wrangler` - Cloudflare Workers CLI
232
+ - `vitest` - Testing framework
233
+
234
+ ## Scripts
235
+
236
+ ```json
237
+ {
238
+ "deploy": "wrangler deploy", // Deploy to Cloudflare Workers
239
+ "dev": "wrangler dev", // Local Workers dev server
240
+ "build": "tsc -p tsconfig.build.json", // Build TypeScript
241
+ "start:stdio": "node dist/stdio.js", // Run built stdio mode
242
+ "start:stdio:dev": "tsx src/stdio.ts", // Run stdio mode in dev
243
+ "test": "vitest", // Run tests
244
+ "cf-typegen": "wrangler types" // Generate Workers types
245
+ }
246
+ ```
247
+
248
+ ## Configuration Files
249
+
250
+ ### package.json
251
+ - Main entry: `dist/stdio.js`
252
+ - Binary: `xcrawl-mcp`
253
+ - Type: ESM module
254
+
255
+ ### tsconfig.json
256
+ - Target: ES2022
257
+ - Module: ESNext
258
+ - Module resolution: bundler
259
+ - Strict mode enabled
260
+
261
+ ### wrangler.jsonc
262
+ - Deployment target: Cloudflare Workers
263
+ - Custom domain routing
264
+ - Observability enabled
265
+ - Minification enabled
266
+
267
+ ## Important Notes
268
+
269
+ ### For Claude/AI Models Using This Server
270
+
271
+ 1. **Default Mode**: Always use sync mode unless user explicitly requests async
272
+ 2. **Screenshot Display**: Always show full URL when screenshot exists in response
273
+ 3. **JSON Data**: Display complete raw JSON data, never summarize unless asked
274
+ 4. **Async Workflow**:
275
+ - Call `xcrawl_scrape` with `"mode": "async"`
276
+ - Extract `scrape_id` from response (not from response.data)
277
+ - Wait 10-15 seconds
278
+ - Call `xcrawl_check_status` with the scrape_id
279
+ 5. **Error Handling**: All errors returned in MCP tool response format with `isError: true`
280
+
281
+ ### Security Considerations
282
+ - API keys handled securely (env vars for stdio, headers for HTTP)
283
+ - TLS verification can be skipped for internal sites
284
+ - No API keys stored in Workers environment
285
+ - Each request authenticates independently
286
+
287
+ ## Git Status
288
+ - Current branch: master
289
+ - Modified files: wrangler.jsonc
290
+ - Recent changes focus on tool descriptions and timeout adjustments
291
+
292
+ ## Related Resources
293
+ - xCrawl API: https://run.xcrawl.com
294
+ - MCP Protocol: https://modelcontextprotocol.io
295
+ - Cloudflare Workers: https://workers.cloudflare.com