npm - xcrawl-mcp - Versions diffs - 1.0.0 - Mend

xcrawl-mcp 1.0.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (51) hide show

package/.editorconfig +12 -0
package/.env.example +3 -0
package/.prettierrc +6 -0
package/README.md +244 -0
package/claude.md +295 -0
package/dist/core/crawl.d.ts +246 -0
package/dist/core/crawl.d.ts.map +1 -0
package/dist/core/crawl.js +141 -0
package/dist/core/crawl.js.map +1 -0
package/dist/core/map.d.ts +34 -0
package/dist/core/map.d.ts.map +1 -0
package/dist/core/map.js +50 -0
package/dist/core/map.js.map +1 -0
package/dist/core/scrape.d.ts +201 -0
package/dist/core/scrape.d.ts.map +1 -0
package/dist/core/scrape.js +148 -0
package/dist/core/scrape.js.map +1 -0
package/dist/core/search.d.ts +144 -0
package/dist/core/search.d.ts.map +1 -0
package/dist/core/search.js +75 -0
package/dist/core/search.js.map +1 -0
package/dist/index.d.ts +8 -0
package/dist/index.d.ts.map +1 -0
package/dist/index.js +516 -0
package/dist/index.js.map +1 -0
package/dist/stdio.d.ts +3 -0
package/dist/stdio.d.ts.map +1 -0
package/dist/stdio.js +551 -0
package/dist/stdio.js.map +1 -0
package/dist/tools.d.ts +540 -0
package/dist/tools.d.ts.map +1 -0
package/dist/tools.js +528 -0
package/dist/tools.js.map +1 -0
package/dist/types.d.ts +214 -0
package/dist/types.d.ts.map +1 -0
package/dist/types.js +5 -0
package/dist/types.js.map +1 -0
package/package.json +33 -0
package/src/core/crawl.ts +149 -0
package/src/core/map.ts +56 -0
package/src/core/scrape.ts +156 -0
package/src/core/search.ts +81 -0
package/src/index.ts +565 -0
package/src/stdio.ts +584 -0
package/src/tools.ts +539 -0
package/src/types.ts +221 -0
package/tsconfig.build.json +14 -0
package/tsconfig.json +45 -0
package/vitest.config.mts +11 -0
package/worker-configuration.d.ts +10848 -0
package/wrangler.jsonc +26 -0

package/.editorconfig ADDED Viewed

@@ -0,0 +1,12 @@
+# http://editorconfig.org
+root = true
+[*]
+indent_style = tab
+end_of_line = lf
+charset = utf-8
+trim_trailing_whitespace = true
+insert_final_newline = true
+[*.yml]
+indent_style = space

package/.env.example ADDED Viewed

@@ -0,0 +1,3 @@
+# xCrawl API Key
+# Get your API key from https://run.xcrawl.com
+XCRAWL_API_KEY=your-xcrawl-api-key-here

package/.prettierrc ADDED Viewed

@@ -0,0 +1,6 @@
+{
+	"printWidth": 140,
+	"singleQuote": true,
+	"semi": true,
+	"useTabs": true
+}

package/README.md ADDED Viewed

@@ -0,0 +1,244 @@
+# XCrawl MCP Server
+Model Context Protocol (MCP) server for XCrawl. It exposes scraping, search, map, and crawl tools to MCP clients.
+## Table of Contents
+- [Prerequisites](#prerequisites)
+- [Quick Start (Stdio)](#quick-start-stdio)
+- [Claude Desktop Configuration](#claude-desktop-configuration)
+- [Cloudflare Workers Deployment](#cloudflare-workers-deployment)
+- [Authentication](#authentication)
+- [Available Tools](#available-tools)
+- [Request Defaults](#request-defaults)
+- [Error Format](#error-format)
+- [Development](#development)
+- [License](#license)
+- [Support](#support)
+## Prerequisites
+- Node.js 18+
+- XCrawl API key from [xcrawl.com](https://xcrawl.com)
+- Cloudflare account (only for Workers deployment)
+## Quick Start (Stdio)
+Run directly with `npx`:
+```bash
+XCRAWL_API_KEY=your-api-key npx -y xcrawl-mcp
+```
+Or install globally:
+```bash
+npm install -g xcrawl-mcp
+XCRAWL_API_KEY=your-api-key xcrawl-mcp
+```
+## Claude Desktop Configuration
+Add to Claude Desktop config:
+- macOS: `~/Library/Application Support/Claude/claude_desktop_config.json`
+- Windows: `%APPDATA%\Claude\claude_desktop_config.json`
+```json
+{
+  "mcpServers": {
+    "xcrawl": {
+      "command": "npx",
+      "args": ["-y", "xcrawl-mcp"],
+      "env": {
+        "XCRAWL_API_KEY": "your-api-key"
+      }
+    }
+  }
+}
+```
+## Cloudflare Workers Deployment
+```bash
+git clone <your-repo>
+cd xcrawl-mcp
+npm install
+npm run deploy
+```
+After deployment:
+- `https://xcrawl-mcp.<your-account>.workers.dev/mcp`
+- `https://xcrawl-mcp.<your-account>.workers.dev/health`
+## Authentication
+### Stdio mode
+Use environment variable:
+- `XCRAWL_API_KEY` (required)
+### Workers mode
+Pass API key in request headers (priority order):
+1. `Authorization: Bearer <api-key>`
+2. `x-api-key: <api-key>`
+3. `x-xcrawl-api-key: <api-key>`
+## Available Tools
+| Tool | Purpose |
+|------|---------|
+| `xcrawl_scrape` | Scrape a single page |
+| `xcrawl_check_status` | Check async scrape status |
+| `xcrawl_search` | Google SERP search |
+| `xcrawl_map` | Discover URLs from a site |
+| `xcrawl_crawl` | Async multi-page crawl |
+| `xcrawl_check_crawl_status` | Check async crawl status |
+### `xcrawl_scrape`
+Basic request:
+```json
+{
+  "url": "https://example.com"
+}
+```
+With extraction (`json.prompt` and `json.json_schema` are optional):
+```json
+{
+  "url": "https://example.com",
+  "output": {
+    "formats": ["json"],
+    "json": {
+      "prompt": "Extract product name and price",
+      "json_schema": {
+        "type": "object",
+        "properties": {
+          "name": { "type": "string" },
+          "price": { "type": "number" }
+        }
+      }
+    }
+  }
+}
+```
+### `xcrawl_check_status`
+```json
+{
+  "scrape_id": "abc-123-def-456"
+}
+```
+Possible values: `pending`, `crawling`, `completed`, `failed`.
+### `xcrawl_search`
+```json
+{
+  "query": "latest AI news",
+  "location": "New York",
+  "language": "en",
+  "limit": 10
+}
+```
+### `xcrawl_map`
+```json
+{
+  "url": "https://example.com",
+  "filter": "blog/.*",
+  "limit": 1000,
+  "include_subdomains": true,
+  "ignore_query_parameters": true
+}
+```
+### `xcrawl_crawl`
+```json
+{
+  "url": "https://example.com",
+  "crawler": {
+    "limit": 100,
+    "include": ["products/.*"],
+    "exclude": ["admin/.*", "login/.*"],
+    "max_depth": 3
+  },
+  "output": {
+    "formats": ["markdown"]
+  }
+}
+```
+### `xcrawl_check_crawl_status`
+```json
+{
+  "crawl_id": "xyz-789-abc-012"
+}
+```
+Possible values: `pending`, `crawling`, `completed`, `failed`.
+## Request Defaults
+Common defaults:
+Defaults may vary by API version.
+| Parameter | Default | Notes |
+|-----------|---------|-------|
+| `mode` | `"sync"` | Scrape only |
+| `proxy.location` | `"US"` | ISO-3166-1 alpha-2 |
+| `request.device` | `"desktop"` | `desktop` or `mobile` |
+| `request.only_main_content` | `true` | Main content filtering |
+| `request.block_ads` | `true` | Ad blocking |
+| `request.skip_tls_verification` | `true` | Skip TLS verification |
+| `js_render.enabled` | `true` | JavaScript rendering |
+| `js_render.wait_until` | `"load"` | `load`, `domcontentloaded`, `networkidle` |
+| `output.formats` | `[]` | If omitted or `[]`, returns `metadata` only |
+| `output.screenshot` | `"viewport"` | `viewport` or `full_page` |
+| `output.json.prompt` | - | Optional |
+| `output.json.json_schema` | - | Optional |
+| `webhook.events` | `["started","completed","failed"]` | Optional callback events |
+## Error Format
+Errors are returned in MCP format:
+```json
+{
+  "content": [
+    {
+      "type": "text",
+      "text": "Error: XCrawl API error: 401 Unauthorized - Invalid API key"
+    }
+  ],
+  "isError": true
+}
+```
+## Development
+```bash
+npm install
+npm run build
+XCRAWL_API_KEY=your-key npm run start:stdio:dev
+npm run dev
+curl http://localhost:8787/health
+```
+## License
+MIT
+## Support
+- XCrawl API: [xcrawl.com](https://xcrawl.com)
+- MCP Server: Open an issue in this repository

package/claude.md ADDED Viewed

@@ -0,0 +1,295 @@
+# xCrawl MCP Server - Claude Documentation
+## Project Overview
+**xCrawl MCP Server** is a Model Context Protocol (MCP) server that integrates the xCrawl web scraping API with AI models like Claude. It enables Claude to access web content, take screenshots, and extract structured data from any URL.
+### Key Information
+- **Name**: xcrawl-mcp
+- **Version**: 1.0.0
+- **Type**: TypeScript/Node.js project with dual deployment modes
+- **Protocol**: MCP (Model Context Protocol) 2024-11-05
+- **API Integration**: xCrawl API (https://run.xcrawl.com)
+## Architecture
+The project supports two deployment modes:
+### 1. stdio Mode (Local)
+- Runs as a local Node.js process
+- Communicates via standard input/output (stdio transport)
+- Used with Claude Desktop and similar MCP clients
+- Entry point: `src/stdio.ts`
+- Requires `XCRAWL_API_KEY` environment variable
+### 2. HTTP Mode (Cloudflare Workers)
+- Deployed to Cloudflare Workers edge network
+- HTTP/JSON-RPC endpoint for remote access
+- Entry point: `src/index.ts`
+- API key provided via request headers (not environment variables)
+- Custom domain: mcp.xcrawl.com
+## Project Structure
+```
+xcrawl-mcp/
+├── src/
+│   ├── index.ts              # Cloudflare Workers entry (HTTP mode)
+│   ├── stdio.ts              # stdio mode entry point
+│   ├── types.ts              # TypeScript type definitions
+│   ├── tools.ts              # Shared MCP tool definitions
+│   └── core/
+│       └── scrape.ts         # Core scraping logic and API calls
+├── package.json              # Dependencies and scripts
+├── tsconfig.json             # TypeScript configuration
+├── tsconfig.build.json       # Build-specific TypeScript config
+├── wrangler.jsonc            # Cloudflare Workers configuration
+├── README.md                 # User documentation
+└── claude.md                 # This file - Claude-specific documentation
+```
+## Core Components
+### 1. MCP Tools (src/tools.ts)
+The server provides two MCP tools:
+#### `xcrawl_scrape`
+Scrapes web pages with extensive configuration options.
+**Key Features:**
+- Multiple output formats: markdown, html, raw_html, links, summary, screenshot, json
+- AI-powered structured data extraction
+- Full browser JavaScript rendering
+- Residential proxy support with location selection
+- Sync/async execution modes
+- Content filtering (ads, navigation removal)
+- Full-page or viewport screenshots
+- Custom headers and cookies
+**Common Usage:**
+```json
+{
+  "url": "https://example.com",
+  "output": { "formats": ["markdown"] }
+}
+```
+#### `xcrawl_check_status`
+Checks status and retrieves results of async scrape tasks.
+**Usage:**
+```json
+{
+  "scrape_id": "job_abc123"
+}
+```
+### 2. Core Scraping Logic (src/core/scrape.ts)
+**Functions:**
+- `callXCrawlAPI()` - Makes POST requests to xCrawl API
+- `checkScrapeStatus()` - Polls job status for async tasks
+- `formatScrapeResponse()` - Formats API responses for MCP
+**Key Details:**
+- API Endpoint: https://run.xcrawl.com/v1/scrape
+- Status Endpoint: https://run.xcrawl.com/v1/jobs/{scrape_id}/result
+- Timeout: 300 seconds (5 minutes)
+- Uses Zod for request validation
+### 3. Type Definitions (src/types.ts)
+**Main Types:**
+- `XCrawlScrapeRequest` - Request parameters for scraping
+- `XCrawlScrapeResponse` - API response structure
+### 4. Entry Points
+#### stdio Mode (src/stdio.ts)
+- Uses `@modelcontextprotocol/sdk` StdioServerTransport
+- Reads API key from `XCRAWL_API_KEY` environment variable
+- Validates requests with Zod schemas
+- Runs as CLI tool
+#### HTTP Mode (src/index.ts)
+- Cloudflare Workers fetch handler
+- Endpoints:
+  - `/mcp` - Main MCP JSON-RPC endpoint
+  - `/health` - Health check
+  - `/sse` - SSE endpoint (not implemented)
+- Extracts API key from headers:
+  - `Authorization: Bearer <key>` (preferred)
+  - `x-api-key: <key>`
+  - `x-xcrawl-api-key: <key>`
+## Authentication
+### stdio Mode
+- API key: `XCRAWL_API_KEY` environment variable
+- Configured in MCP client settings
+### Cloudflare Workers
+- API key: Request headers only
+- No environment variables or secrets required
+- Each request provides its own API key
+## Deployment
+### Local Development (stdio)
+```bash
+npm install
+npm run build
+XCRAWL_API_KEY=your-key npm run start:stdio:dev
+```
+### Cloudflare Workers
+```bash
+npm install
+npm run deploy
+```
+**Deployment Configuration (wrangler.jsonc):**
+- Account ID: e579d774006254eebf5dd7dcb44bacd1
+- Custom domain: mcp.xcrawl.com
+- Compatibility date: 2025-12-24
+- Minification: enabled
+- Observability: traces and logs enabled
+## Key Features
+### Output Formats
+- `html` - Cleaned HTML (no scripts, styles)
+- `raw_html` - Original HTML with all tags
+- `markdown` - Markdown conversion
+- `links` - Extracted URLs
+- `summary` - AI-generated summary
+- `screenshot` - Image URL (full_page or viewport)
+- `json` - AI-extracted structured data
+### AI-Powered Features
+- **Structured Data Extraction**: Use natural language prompts to extract data
+- **Smart Summarization**: AI-generated page summaries
+- **Schema Validation**: Optional JSON Schema for strict output
+### Performance Features
+- 300-second timeout for complex pages
+- Async mode for long-running tasks
+- Proxy support for geo-restricted content
+- JavaScript rendering with configurable wait conditions
+## Response Structure
+### Successful Scrape Response
+```json
+{
+  "scrape_id": "uuid",
+  "endpoint": "scrape",
+  "version": "string",
+  "status": "success",
+  "url": "https://example.com",
+  "data": {
+    "markdown": "# Content...",
+    "html": "cleaned HTML...",
+    "raw_html": "<!DOCTYPE html>...",
+    "links": ["https://..."],
+    "screenshot": "https://storage.url/screenshot.png",
+    "summary": "AI summary...",
+    "json": { "extracted": "data" },
+    "metadata": {
+      "title": "Page Title",
+      "status_code": 200,
+      "proxy_location": "US",
+      "proxy_sticky_session": "session-id"
+    }
+  },
+  "startedAt": "2025-12-24T00:00:00Z",
+  "endedAt": "2025-12-24T00:00:05Z"
+}
+```
+### Async Task Response
+```json
+{
+  "scrape_id": "uuid",
+  "status": "pending",
+  "message": "Task queued for processing"
+}
+```
+## Dependencies
+### Production
+- `@modelcontextprotocol/sdk` ^1.0.4 - MCP protocol implementation
+- `zod` ^3.23.8 - Schema validation
+### Development
+- `@cloudflare/vitest-pool-workers` - Workers testing
+- `@cloudflare/workers-types` - TypeScript types for Workers
+- `typescript` - TypeScript compiler
+- `tsx` - TypeScript execution for development
+- `wrangler` - Cloudflare Workers CLI
+- `vitest` - Testing framework
+## Scripts
+```json
+{
+  "deploy": "wrangler deploy",              // Deploy to Cloudflare Workers
+  "dev": "wrangler dev",                     // Local Workers dev server
+  "build": "tsc -p tsconfig.build.json",    // Build TypeScript
+  "start:stdio": "node dist/stdio.js",       // Run built stdio mode
+  "start:stdio:dev": "tsx src/stdio.ts",    // Run stdio mode in dev
+  "test": "vitest",                          // Run tests
+  "cf-typegen": "wrangler types"            // Generate Workers types
+}
+```
+## Configuration Files
+### package.json
+- Main entry: `dist/stdio.js`
+- Binary: `xcrawl-mcp`
+- Type: ESM module
+### tsconfig.json
+- Target: ES2022
+- Module: ESNext
+- Module resolution: bundler
+- Strict mode enabled
+### wrangler.jsonc
+- Deployment target: Cloudflare Workers
+- Custom domain routing
+- Observability enabled
+- Minification enabled
+## Important Notes
+### For Claude/AI Models Using This Server
+1. **Default Mode**: Always use sync mode unless user explicitly requests async
+2. **Screenshot Display**: Always show full URL when screenshot exists in response
+3. **JSON Data**: Display complete raw JSON data, never summarize unless asked
+4. **Async Workflow**:
+   - Call `xcrawl_scrape` with `"mode": "async"`
+   - Extract `scrape_id` from response (not from response.data)
+   - Wait 10-15 seconds
+   - Call `xcrawl_check_status` with the scrape_id
+5. **Error Handling**: All errors returned in MCP tool response format with `isError: true`
+### Security Considerations
+- API keys handled securely (env vars for stdio, headers for HTTP)
+- TLS verification can be skipped for internal sites
+- No API keys stored in Workers environment
+- Each request authenticates independently
+## Git Status
+- Current branch: master
+- Modified files: wrangler.jsonc
+- Recent changes focus on tool descriptions and timeout adjustments
+## Related Resources
+- xCrawl API: https://run.xcrawl.com
+- MCP Protocol: https://modelcontextprotocol.io
+- Cloudflare Workers: https://workers.cloudflare.com