crawlforge-mcp-server 4.2.1 → 4.2.3

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
package/CLAUDE.md CHANGED
@@ -60,9 +60,9 @@ These guidelines are working if: fewer unnecessary changes in diffs, fewer rewri
60
60
 
61
61
  ## Project Overview
62
62
 
63
- CrawlForge MCP Server - A professional MCP (Model Context Protocol) server providing 20 web scraping, crawling, and content processing tools.
63
+ CrawlForge MCP Server - A professional MCP (Model Context Protocol) server providing 23 web scraping, crawling, and content processing tools (5 inline + 18 advanced).
64
64
 
65
- **Current Version:** 3.0.12
65
+ **Current Version:** 4.2.2
66
66
 
67
67
  ## Development Commands
68
68
 
@@ -92,6 +92,12 @@ npm run dev
92
92
  # Test MCP protocol compliance
93
93
  npm test
94
94
 
95
+ # Unit tests (131 tests across 17 tools, no live network)
96
+ npm run test:unit
97
+
98
+ # Integration tests
99
+ npm run test:integration
100
+
95
101
  # Functional tests
96
102
  node test-tools.js # Test all tools
97
103
  node test-real-world.js # Test real-world usage scenarios
@@ -99,6 +105,13 @@ node test-real-world.js # Test real-world usage scenarios
99
105
  # MCP Protocol tests
100
106
  node tests/integration/mcp-protocol-compliance.test.js
101
107
 
108
+ # CLI (v4.1.0+, requires global install or npx)
109
+ crawlforge --help # Show all 15 subcommands
110
+ crawlforge scrape https://example.com
111
+ crawlforge batch --urls urls.txt --format markdown
112
+ crawlforge install-skills --target claude-code
113
+ # See docs/cli-guide.md for full reference
114
+
102
115
  # Docker
103
116
  npm run docker:build # Build Docker image
104
117
  npm run docker:dev # Run development container
@@ -124,30 +137,37 @@ npm run docker:prod # Run production container
124
137
  - **WebhookDispatcher**: Event notification system for job completion callbacks
125
138
  - **ActionExecutor**: Browser automation engine (Playwright-based)
126
139
  - **ResearchOrchestrator**: Multi-stage research with query expansion and synthesis
127
- - **StealthBrowserManager**: Stealth mode scraping with anti-detection
140
+ - **StealthBrowserManager**: Stealth mode scraping with anti-detection; Camoufox (Firefox) engine added in v4.0.0
128
141
  - **LocalizationManager**: Multi-language content and localization
129
142
  - **ChangeTracker**: Content change tracking over time
130
143
  - **SnapshotManager**: Website snapshots and version history
144
+ - **ResourceRegistry**: MCP Resources (crawlforge:// URI scheme, 5 resource types) — D1.1, v3.6.0
145
+ - **PromptRegistry** (`src/prompts/`): 5 workflow prompts — D1.2, v3.6.0
146
+ - **SamplingClient**: MCP Sampling with Ollama-API fallback chain — D1.3, v3.6.0
147
+ - **ElicitationHelper**: MCP Elicitation for user confirmation on expensive operations — D1.4, v3.6.0
148
+ - **endpointGuard**: Allow-list guard for server's own backend calls — v3.0.18
131
149
 
132
150
  ### Tool Layer (`src/tools/`)
133
151
 
134
152
  Tools are organized in subdirectories by category:
135
153
 
136
154
  - `advanced/` - BatchScrapeTool, ScrapeWithActionsTool
155
+ - `basic/` - fetchUrl, extractText, extractLinks, extractMetadata, scrapeStructured
137
156
  - `crawl/` - crawlDeep, mapSite
138
- - `extract/` - analyzeContent, extractContent, processDocument, summarizeContent
157
+ - `extract/` - analyzeContent, extractContent, extractStructured, extractWithLlm, listOllamaModels, processDocument, summarizeContent
139
158
  - `research/` - deepResearch
140
159
  - `search/` - searchWeb (proxied through CrawlForge.dev API)
160
+ - `templates/` - ScrapeTemplateTool (10 pre-built site templates, v4.0.0)
141
161
  - `tracking/` - trackChanges
142
162
  - `llmstxt/` - generateLLMsTxt
143
163
 
144
- ### Available MCP Tools (20 total)
164
+ ### Available MCP Tools (23 total)
145
165
 
146
- **Basic Tools (server.js inline):**
166
+ **Basic Tools (server.js inline, 5):**
147
167
  fetch_url, extract_text, extract_links, extract_metadata, scrape_structured
148
168
 
149
- **Advanced Tools:**
150
- search_web, crawl_deep, map_site, extract_content, process_document, summarize_content, analyze_content, extract_structured, batch_scrape, scrape_with_actions, deep_research, track_changes, generate_llms_txt, stealth_mode, localization
169
+ **Advanced Tools (18):**
170
+ search_web, crawl_deep, map_site, extract_content, process_document, summarize_content, analyze_content, extract_structured, extract_with_llm, list_ollama_models, batch_scrape, scrape_with_actions, deep_research, track_changes, generate_llms_txt, stealth_mode, localization, scrape_template
151
171
 
152
172
  ### MCP Server Entry Point
153
173
 
@@ -202,10 +222,23 @@ When adding a new tool to server.js:
202
222
  5. Add to cleanup array in gracefulShutdown if it has `destroy()` or `cleanup()` methods
203
223
  6. Update tool count in console log at server startup
204
224
 
225
+ ## Sandboxing & Approvals
226
+
227
+ Key mechanisms for security-conscious future sessions:
228
+
229
+ - **SSRF** (`src/utils/ssrfProtection.js`): Every scraped URL validated — http/https only; blocks loopback, RFC1918, IPv6 ULA/link-local, cloud metadata endpoints; blocks dangerous ports (22, 25, 53, 445, 3306, 5432, 6379, 27017, etc.); redirects re-validated per hop, capped at 5; pre-parse path-traversal rejection. Blocklist-based — no per-deployment outbound allowlist.
230
+ - **endpointGuard** (`src/core/endpointGuard.js`): Hard allow-list of `{crawlforge.dev, www.crawlforge.dev, api.crawlforge.dev}` for the server's own backend calls; HTTPS required; fail-closed. Localhost only in creator mode (v3.0.18).
231
+ - **Action allowlist** (`src/core/ActionExecutor.js`): `scrape_with_actions` accepts only 7 action types: `wait`, `click`, `type`, `press`, `scroll`, `screenshot`, `executeJavaScript`. `executeJavaScript` throws unless `ALLOW_JAVASCRIPT_EXECUTION=true` is set at deploy time (off by default).
232
+ - **Elicitation** (`src/core/ElicitationHelper.js`): User confirmation requested for `deep_research` (>50 URLs), `batch_scrape` (sync, >25 URLs), `crawl_deep` (projected >500 pages), `extract_structured` (schema has >3 required fields, no LLM configured), and credit-low situations. Fail-open if client does not support elicitation.
233
+ - **Browser sandboxing**: Standard pool retains OS sandbox. Stealth Chromium uses `--no-sandbox` + `--disable-web-security` (deliberate fingerprint-spoofing trade-off). Camoufox (Firefox, v4.0.0) is the alternative — see `docs/stealth-engines.md`.
234
+
235
+ See `docs/sandboxing-and-approvals.md` for the full reference.
236
+
205
237
  ## Security
206
238
 
207
239
  Security testing and CI/CD pipeline details are in:
208
240
 
241
+ - `docs/sandboxing-and-approvals.md` — Canonical sandboxing & approvals reference
209
242
  - `docs/security-audit-report.md` — Full security audit
210
243
  - `.github/workflows/ci.yml` — CI pipeline with security checks
211
244
  - `.github/workflows/security.yml` — Daily scheduled security scanning
package/README.md CHANGED
@@ -9,7 +9,7 @@ Professional web scraping and content extraction server implementing the Model C
9
9
 
10
10
  ## 🎯 Features
11
11
 
12
- - **22 Professional Tools**: Web scraping, deep research, stealth browsing, content analysis, local-LLM extraction (Ollama)
12
+ - **23 Professional Tools**: Web scraping, deep research, stealth browsing, content analysis, local-LLM extraction (Ollama)
13
13
  - **Free Tier**: 1,000 credits to get started instantly
14
14
  - **MCP Compatible**: Works with Claude, Cursor, and other MCP-enabled AI tools
15
15
  - **Enterprise Ready**: Scale up with paid plans for production use
@@ -140,7 +140,7 @@ Restart Cursor to activate.
140
140
  | **Enterprise** | 250,000 | Large scale operations |
141
141
 
142
142
  **All plans include:**
143
- - Access to all 22 tools
143
+ - Access to all 23 tools
144
144
  - Credits never expire and roll over month-to-month
145
145
  - API access and webhook notifications
146
146
 
@@ -216,6 +216,17 @@ Once configured, use these tools in your AI assistant:
216
216
  - **Rate Limiting**: Built-in protection against abuse
217
217
  - **Compliance**: Respects robots.txt and GDPR requirements
218
218
 
219
+ ### Security & Approvals
220
+
221
+ - **SSRF enforcement**: Every scraped URL is validated before the request is sent — http/https only; blocks loopback, RFC1918, IPv6 private/link-local ranges, cloud metadata endpoints (GCP, Azure), and dangerous ports (SSH, SMTP, DNS, MySQL, Postgres, Redis, MongoDB, etc.). Redirects are re-validated each hop, capped at 5.
222
+ - **Backend endpoint guard** (v3.0.18): The server's own calls to CrawlForge.dev use a separate fail-closed allow-list (`{crawlforge.dev, www.crawlforge.dev, api.crawlforge.dev}`, HTTPS required). Setting `CRAWLFORGE_API_URL` to an arbitrary host is blocked at parse time.
223
+ - **Action allowlist**: `scrape_with_actions` accepts only 7 action types (`wait`, `click`, `type`, `press`, `scroll`, `screenshot`, `executeJavaScript`). No download, file-write, or arbitrary cross-page navigation primitives exist.
224
+ - **JavaScript gate**: The `executeJavaScript` action throws by default. Set `ALLOW_JAVASCRIPT_EXECUTION=true` at deploy time to enable (not recommended in production).
225
+ - **MCP Elicitation** (v3.6.0): Four tools request user confirmation before executing expensive operations — `deep_research` (>50 URLs), `batch_scrape` (sync mode, >25 URLs), `crawl_deep` (projected >500 pages), `extract_structured` (schema has >3 required fields with no LLM configured). Credit-low situations also elicit. Confirmation is best-effort: if the MCP client does not support elicitation the tool proceeds (fail-open).
226
+ - **Per-tool credit gating**: Every tool is wrapped with `withAuth()`, which checks and deducts credits before execution. Fail-closed since v3.0.18.
227
+
228
+ See [docs/sandboxing-and-approvals.md](docs/sandboxing-and-approvals.md) for the full reference.
229
+
219
230
  ### Security Updates
220
231
 
221
232
  **v3.0.3 (2025-10-01)**: Removed authentication bypass vulnerability. All users must authenticate with valid API keys.
package/package.json CHANGED
@@ -1,6 +1,6 @@
1
1
  {
2
2
  "name": "crawlforge-mcp-server",
3
- "version": "4.2.1",
3
+ "version": "4.2.3",
4
4
  "description": "CrawlForge MCP Server - Professional Model Context Protocol server with 23 web scraping, crawling, and content processing tools. Defaults to local Ollama for LLM extraction (no API key needed); OpenAI/Anthropic available as opt-in. v4.0 adds Markdown-first output, pre-built site templates, Camoufox stealth engine, and cost transparency.",
5
5
  "main": "server.js",
6
6
  "bin": {
@@ -19,7 +19,7 @@
19
19
  "test:tools": "node test-tools.js",
20
20
  "test:real-world": "node test-real-world.js",
21
21
  "test:all": "bash run-all-tests.sh",
22
- "postinstall": "echo '\n\ud83c\udf89 CrawlForge MCP Server installed!\n\nRun \"npx crawlforge-setup\" to configure your API key and get started.\n'",
22
+ "postinstall": "echo '\n🎉 CrawlForge MCP Server installed!\n\nRun \"npx crawlforge-setup\" to configure your API key and get started.\n'",
23
23
  "docker:build": "docker build -t crawlforge .",
24
24
  "docker:dev": "docker-compose up crawlforge-dev",
25
25
  "docker:prod": "docker-compose up crawlforge-prod"
package/server.js CHANGED
@@ -95,7 +95,7 @@ if (configErrors.length > 0 && config.server.nodeEnv === 'production') {
95
95
  // Create the server
96
96
  const server = new McpServer({
97
97
  name: "crawlforge",
98
- version: "4.2.1",
98
+ version: "4.2.2",
99
99
  description: "Production-ready MCP server with 23 web scraping, crawling, and content processing tools. Features MCP Resources (crawlforge://), Prompts, Sampling fallback, Elicitation, stealth browsing, deep research, structured extraction, change tracking, and local-LLM extraction via Ollama.",
100
100
  homepage: "https://www.crawlforge.dev",
101
101
  icon: "https://www.crawlforge.dev/icon.png"
@@ -110,7 +110,7 @@ server.prompt("getting-started", {
110
110
  role: "user",
111
111
  content: {
112
112
  type: "text",
113
- text: "You have access to CrawlForge MCP with 22 web scraping tools. Key tools:\n\n" +
113
+ text: "You have access to CrawlForge MCP with 23 web scraping tools. Key tools:\n\n" +
114
114
  "- fetch_url: Fetch raw HTML/content from any URL\n" +
115
115
  "- extract_text: Extract clean text from a webpage\n" +
116
116
  "- extract_content: Smart content extraction with readability\n" +