crawlforge-mcp-server 3.0.11 → 3.0.13

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
package/CLAUDE.md CHANGED
@@ -2,12 +2,67 @@
2
2
 
3
3
  This file provides guidance to Claude Code (claude.ai/code) when working with code in this repository.
4
4
 
5
+ Behavioral guidelines to reduce common LLM coding mistakes. Merge with project-specific instructions as needed.
6
+
7
+ Tradeoff: These guidelines bias toward caution over speed. For trivial tasks, use judgment.
8
+
9
+ 1. Think Before Coding
10
+ Don't assume. Don't hide confusion. Surface tradeoffs.
11
+
12
+ Before implementing:
13
+
14
+ State your assumptions explicitly. If uncertain, ask.
15
+ If multiple interpretations exist, present them - don't pick silently.
16
+ If a simpler approach exists, say so. Push back when warranted.
17
+ If something is unclear, stop. Name what's confusing. Ask.
18
+
19
+ 2. Simplicity First
20
+ Minimum code that solves the problem. Nothing speculative.
21
+
22
+ No features beyond what was asked.
23
+ No abstractions for single-use code.
24
+ No "flexibility" or "configurability" that wasn't requested.
25
+ No error handling for impossible scenarios.
26
+ If you write 200 lines and it could be 50, rewrite it.
27
+ Ask yourself: "Would a senior engineer say this is overcomplicated?" If yes, simplify.
28
+
29
+ 3. Surgical Changes
30
+ Touch only what you must. Clean up only your own mess.
31
+
32
+ When editing existing code:
33
+
34
+ Don't "improve" adjacent code, comments, or formatting.
35
+ Don't refactor things that aren't broken.
36
+ Match existing style, even if you'd do it differently.
37
+ If you notice unrelated dead code, mention it - don't delete it.
38
+ When your changes create orphans:
39
+
40
+ Remove imports/variables/functions that YOUR changes made unused.
41
+ Don't remove pre-existing dead code unless asked.
42
+ The test: Every changed line should trace directly to the user's request.
43
+
44
+ 4. Goal-Driven Execution
45
+ Define success criteria. Loop until verified.
46
+
47
+ Transform tasks into verifiable goals:
48
+
49
+ "Add validation" → "Write tests for invalid inputs, then make them pass"
50
+ "Fix the bug" → "Write a test that reproduces it, then make it pass"
51
+ "Refactor X" → "Ensure tests pass before and after"
52
+ For multi-step tasks, state a brief plan:
53
+
54
+ 1. [Step] → verify: [check]
55
+ 2. [Step] → verify: [check]
56
+ 3. [Step] → verify: [check]
57
+ Strong success criteria let you loop independently. Weak criteria ("make it work") require constant clarification.
58
+
59
+ These guidelines are working if: fewer unnecessary changes in diffs, fewer rewrites due to overcomplication, and clarifying questions come before implementation rather than after mistakes.
60
+
5
61
  ## Project Overview
6
62
 
7
- CrawlForge MCP Server - A professional MCP (Model Context Protocol) server implementation providing 19 comprehensive web scraping, crawling, and content processing tools. Version 3.0.3 includes advanced content extraction, document processing, summarization, and analysis capabilities. Wave 2 adds asynchronous batch processing and browser automation features. Wave 3 introduces deep research orchestration, stealth scraping, localization, and change tracking.
63
+ CrawlForge MCP Server - A professional MCP (Model Context Protocol) server providing 19 web scraping, crawling, and content processing tools.
8
64
 
9
- **Current Version:** 3.0.3
10
- **Security Status:** Secure (authentication bypass vulnerability fixed in v3.0.3)
65
+ **Current Version:** 3.0.12
11
66
 
12
67
  ## Development Commands
13
68
 
@@ -28,6 +83,9 @@ export CRAWLFORGE_API_KEY="your_api_key_here"
28
83
  # Run the server (production)
29
84
  npm start
30
85
 
86
+ # HTTP transport mode
87
+ npm run start:http
88
+
31
89
  # Development mode with verbose logging
32
90
  npm run dev
33
91
 
@@ -35,39 +93,27 @@ npm run dev
35
93
  npm test
36
94
 
37
95
  # Functional tests
38
- node test-tools.js # Test all tools (basic, Wave 2, Wave 3)
96
+ node test-tools.js # Test all tools
39
97
  node test-real-world.js # Test real-world usage scenarios
40
98
 
41
99
  # MCP Protocol tests
42
- node tests/integration/mcp-protocol-compliance.test.js # MCP protocol compliance
100
+ node tests/integration/mcp-protocol-compliance.test.js
43
101
 
44
- # Docker commands
102
+ # Docker
45
103
  npm run docker:build # Build Docker image
46
104
  npm run docker:dev # Run development container
47
105
  npm run docker:prod # Run production container
48
- npm run docker:test # Run test container
49
- npm run docker:perf # Run performance test container
50
-
51
- # Security Testing (CI/CD Integration)
52
- npm run test:security # Run comprehensive security test suite
53
- npm audit # Check for dependency vulnerabilities
54
- npm audit fix # Automatically fix vulnerabilities
55
- npm outdated # Check for outdated packages
56
-
57
- # Release management
58
- npm run release:patch # Patch version bump
59
- npm run release:minor # Minor version bump
60
- npm run release:major # Major version bump
61
-
62
- # Cleanup
63
- npm run clean # Remove cache, logs, test results
64
-
65
- # Running specific test files
66
- node tests/integration/mcp-protocol-compliance.test.js # MCP protocol compliance
67
- node test-tools.js # All tools functional test
68
- node test-real-world.js # Real-world scenarios test
69
106
  ```
70
107
 
108
+ ### Debugging Tips
109
+
110
+ - Server logs via Winston logger (stderr for status, stdout for MCP protocol)
111
+ - Set `NODE_ENV=development` for verbose logging
112
+ - Use `--expose-gc` flag for memory profiling: `node --expose-gc server.js`
113
+ - Check `cache/` directory for cached responses
114
+ - Review `logs/` directory for application logs
115
+ - Memory monitoring auto-enabled in development mode (logs every 60s if >200MB)
116
+
71
117
  ## High-Level Architecture
72
118
 
73
119
  ### Core Infrastructure (`src/core/`)
@@ -76,12 +122,12 @@ node test-real-world.js # Real-world scenarios t
76
122
  - **PerformanceManager**: Centralized performance monitoring and optimization
77
123
  - **JobManager**: Asynchronous job tracking and management for batch operations
78
124
  - **WebhookDispatcher**: Event notification system for job completion callbacks
79
- - **ActionExecutor**: Browser automation engine for complex interactions (Playwright-based)
80
- - **ResearchOrchestrator**: Coordinates multi-stage research with query expansion and synthesis
81
- - **StealthBrowserManager**: Manages stealth mode scraping with anti-detection features
82
- - **LocalizationManager**: Handles multi-language content and localization
83
- - **ChangeTracker**: Tracks and compares content changes over time
84
- - **SnapshotManager**: Manages website snapshots and version history
125
+ - **ActionExecutor**: Browser automation engine (Playwright-based)
126
+ - **ResearchOrchestrator**: Multi-stage research with query expansion and synthesis
127
+ - **StealthBrowserManager**: Stealth mode scraping with anti-detection
128
+ - **LocalizationManager**: Multi-language content and localization
129
+ - **ChangeTracker**: Content change tracking over time
130
+ - **SnapshotManager**: Website snapshots and version history
85
131
 
86
132
  ### Tool Layer (`src/tools/`)
87
133
 
@@ -91,45 +137,26 @@ Tools are organized in subdirectories by category:
91
137
  - `crawl/` - crawlDeep, mapSite
92
138
  - `extract/` - analyzeContent, extractContent, processDocument, summarizeContent
93
139
  - `research/` - deepResearch
94
- - `search/` - searchWeb (uses CrawlForge proxy for Google Search)
140
+ - `search/` - searchWeb (proxied through CrawlForge.dev API)
95
141
  - `tracking/` - trackChanges
96
142
  - `llmstxt/` - generateLLMsTxt
97
143
 
98
144
  ### Available MCP Tools (19 total)
99
145
 
100
146
  **Basic Tools (server.js inline):**
101
-
102
- - fetch_url, extract_text, extract_links, extract_metadata, scrape_structured
147
+ fetch_url, extract_text, extract_links, extract_metadata, scrape_structured
103
148
 
104
149
  **Advanced Tools:**
105
-
106
- - search_web, crawl_deep, map_site
107
- - extract_content, process_document, summarize_content, analyze_content
108
- - batch_scrape, scrape_with_actions, deep_research
109
- - track_changes, generate_llms_txt, stealth_mode, localization
150
+ search_web, crawl_deep, map_site, extract_content, process_document, summarize_content, analyze_content, batch_scrape, scrape_with_actions, deep_research, track_changes, generate_llms_txt, stealth_mode, localization
110
151
 
111
152
  ### MCP Server Entry Point
112
153
 
113
154
  The main server implementation is in `server.js` which:
114
155
 
115
- 1. **Secure Creator Mode** (server.js lines 3-25):
116
- - Loads `.env` file early to check for `CRAWLFORGE_CREATOR_SECRET`
117
- - Validates secret using SHA256 hash comparison
118
- - Only creator with valid secret UUID can enable unlimited access
119
- - Hash stored in code is safe to commit (one-way cryptographic hash)
120
-
121
- 2. **Authentication Flow**: Uses AuthManager for API key validation and credit tracking
122
- - Checks for authentication on startup
123
- - Auto-setup if CRAWLFORGE_API_KEY environment variable is present
124
- - Creator mode bypasses credit checks for development/testing
125
-
126
- 3. **Tool Registration**: All tools registered via `server.registerTool()` pattern
127
- - Wrapped with `withAuth()` function for credit tracking and authentication
128
- - Each tool has inline Zod schema for parameter validation
129
- - Response format uses `content` array with text objects
130
-
131
- 4. **Transport**: Uses stdio transport for MCP protocol communication
132
-
156
+ 1. **Secure Creator Mode**: Loads `.env` early, validates secret via SHA256 hash comparison
157
+ 2. **Authentication Flow**: AuthManager for API key validation and credit tracking
158
+ 3. **Tool Registration**: All tools registered via `server.registerTool()`, wrapped with `withAuth()` for credit tracking
159
+ 4. **Transport**: stdio transport for MCP protocol communication
133
160
  5. **Graceful Shutdown**: Cleans up browser instances, job managers, and other resources
134
161
 
135
162
  ### Tool Credit System
@@ -139,7 +166,6 @@ Each tool wrapped with `withAuth(toolName, handler)`:
139
166
  - Checks credits before execution (skipped in creator mode)
140
167
  - Reports usage with credit deduction on success
141
168
  - Charges half credits on error
142
- - Returns credit error if insufficient balance
143
169
  - Creator mode: Unlimited access for package maintainer
144
170
 
145
171
  ### Key Configuration
@@ -147,21 +173,11 @@ Each tool wrapped with `withAuth(toolName, handler)`:
147
173
  Critical environment variables defined in `src/constants/config.js`:
148
174
 
149
175
  ```bash
150
- # Authentication (required for users)
151
176
  CRAWLFORGE_API_KEY=your_api_key_here
152
-
153
- # Creator Mode (maintainer only - KEEP SECRET!)
154
- # CRAWLFORGE_CREATOR_SECRET=your-uuid-secret
155
- # Enables unlimited access for development/testing
156
-
157
-
158
- # Performance Settings
159
177
  MAX_WORKERS=10
160
178
  QUEUE_CONCURRENCY=10
161
179
  CACHE_TTL=3600000
162
180
  RATE_LIMIT_REQUESTS_PER_SECOND=10
163
-
164
- # Crawling Limits
165
181
  MAX_CRAWL_DEPTH=5
166
182
  MAX_PAGES_PER_CRAWL=100
167
183
  RESPECT_ROBOTS_TXT=true
@@ -173,44 +189,6 @@ RESPECT_ROBOTS_TXT=true
173
189
  - `.env` - Environment variables for development
174
190
  - `src/constants/config.js` - Central configuration with defaults and validation
175
191
 
176
- ## Common Development Tasks
177
-
178
- ### Running a Single Test
179
-
180
- ```bash
181
- # Run a specific test file
182
- node tests/unit/linkAnalyzer.test.js
183
-
184
- # Run a specific Wave test
185
- node tests/validation/test-batch-scrape.js
186
-
187
- # Run Wave 3 tests with verbose output
188
- npm run test:wave3:verbose
189
- ```
190
-
191
- ### Testing Tool Integration
192
-
193
- ```bash
194
- # Test MCP protocol compliance
195
- npm test
196
-
197
- # Test specific tool functionality
198
- node tests/validation/test-batch-scrape.js
199
- node tests/validation/test-scrape-with-actions.js
200
-
201
- # Test research features
202
- node tests/validation/wave3-validation.js
203
- ```
204
-
205
- ### Debugging Tips
206
-
207
- - Server logs are written to console via Winston logger (stderr for status, stdout for MCP protocol)
208
- - Set `NODE_ENV=development` for verbose logging
209
- - Use `--expose-gc` flag for memory profiling: `node --expose-gc server.js`
210
- - Check `cache/` directory for cached responses
211
- - Review `logs/` directory for application logs
212
- - Memory monitoring automatically enabled in development mode (logs every 60s if >200MB)
213
-
214
192
  ### Adding New Tools
215
193
 
216
194
  When adding a new tool to server.js:
@@ -220,225 +198,35 @@ When adding a new tool to server.js:
220
198
  3. Register with `server.registerTool(name, { description, inputSchema }, withAuth(name, handler))`
221
199
  4. Ensure tool implements `execute(params)` method
222
200
  5. Add to cleanup array in gracefulShutdown if it has `destroy()` or `cleanup()` methods
223
- 6. Update tool count in console log at server startup (line 1860)
224
-
225
- ## CI/CD Security Integration
226
-
227
- ### Automated Security Testing Pipeline
228
-
229
- The project includes comprehensive security testing integrated into the CI/CD pipeline:
230
-
231
- #### Main CI Pipeline (`.github/workflows/ci.yml`)
232
-
233
- The CI pipeline runs on every PR and push to main/develop branches and includes:
234
-
235
- **Security Test Suite:**
236
-
237
- - SSRF Protection validation
238
- - Input validation (XSS, SQL injection, command injection)
239
- - Rate limiting functionality
240
- - DoS protection measures
241
- - Regex DoS vulnerability detection
201
+ 6. Update tool count in console log at server startup
242
202
 
243
- **Dependency Security:**
203
+ ## Security
244
204
 
245
- - npm audit with JSON output and summary generation
246
- - Vulnerability severity analysis (critical/high/moderate/low)
247
- - License compliance checking
248
- - Outdated package detection
205
+ Security testing and CI/CD pipeline details are in:
249
206
 
250
- **Static Code Analysis:**
207
+ - `docs/security-audit-report.md` — Full security audit
208
+ - `.github/workflows/ci.yml` — CI pipeline with security checks
209
+ - `.github/workflows/security.yml` — Daily scheduled security scanning
210
+ - `.github/SECURITY.md` — Security policy and procedures
251
211
 
252
- - CodeQL security analysis with extended queries
253
- - ESLint security rules for dangerous patterns
254
- - Hardcoded secret detection
255
- - Security file scanning
212
+ Run `npm audit` locally to check dependencies.
256
213
 
257
- **Reporting & Artifacts:**
258
-
259
- - Comprehensive security reports generated
260
- - PR comments with security summaries
261
- - Artifact upload for detailed analysis
262
- - Build failure on critical vulnerabilities
263
-
264
- #### Dedicated Security Workflow (`.github/workflows/security.yml`)
265
-
266
- Daily scheduled comprehensive security scanning:
267
-
268
- **Dependency Security Scan:**
269
-
270
- - Full vulnerability audit with configurable severity levels
271
- - License compliance verification
272
- - Detailed vulnerability reporting
273
-
274
- **Static Code Analysis:**
275
-
276
- - Extended CodeQL analysis with security-focused queries
277
- - ESLint security plugin integration
278
- - Pattern-based secret detection
279
-
280
- **Container Security:**
281
-
282
- - Trivy vulnerability scanning
283
- - SARIF report generation
284
- - Container base image analysis
285
-
286
- **Automated Issue Creation:**
287
-
288
- - GitHub issues created for critical vulnerabilities
289
- - Detailed security reports with remediation steps
290
- - Configurable severity thresholds
291
-
292
- ### Security Thresholds and Policies
293
-
294
- **Build Failure Conditions:**
295
-
296
- - Any critical severity vulnerabilities
297
- - More than 3 high severity vulnerabilities
298
- - Security test suite failures
299
-
300
- **Automated Actions:**
301
-
302
- - Daily security scans at 2 AM UTC
303
- - PR blocking for security failures
304
- - Automatic security issue creation
305
- - Comprehensive artifact collection
306
-
307
- ### Running Security Tests Locally
308
-
309
- ```bash
310
- # Run the complete security test suite
311
- npm run test:security
312
-
313
- # Check for dependency vulnerabilities
314
- npm audit --audit-level moderate
315
-
316
- # Fix automatically resolvable vulnerabilities
317
- npm audit fix
318
-
319
- # Generate security report manually
320
- mkdir security-results
321
- npm audit --json > security-results/audit.json
322
-
323
- # Run specific security validation
324
- node tests/security/security-test-suite.js
325
- ```
326
-
327
- ### Security Artifacts and Reports
328
-
329
- **Generated Reports:**
330
-
331
- - `SECURITY-REPORT.md`: Comprehensive security assessment
332
- - `npm-audit.json`: Detailed vulnerability data
333
- - `security-tests.log`: Test execution logs
334
- - `dependency-analysis.md`: Package security analysis
335
- - `license-check.md`: License compliance report
336
-
337
- **Artifact Retention:**
338
-
339
- - CI security results: 30 days
340
- - Comprehensive security reports: 90 days
341
- - Critical vulnerability reports: Indefinite
342
-
343
- ### Manual Security Scan Triggers
344
-
345
- The security workflow can be manually triggered with custom parameters:
346
-
347
- ```bash
348
- # Via GitHub CLI
349
- gh workflow run security.yml \
350
- --field scan_type=all \
351
- --field severity_threshold=moderate
352
-
353
- # Via GitHub UI
354
- # Go to Actions > Security Scanning > Run workflow
355
- ```
356
-
357
- **Available Options:**
358
-
359
- - `scan_type`: all, dependencies, code-analysis, container-scan
360
- - `severity_threshold`: low, moderate, high, critical
361
-
362
- ### Security Integration Best Practices
363
-
364
- **For Contributors:**
365
-
366
- 1. Always run `npm run test:security` before submitting PRs
367
- 2. Address any security warnings in your code
368
- 3. Keep dependencies updated with `npm audit fix`
369
- 4. Review security artifacts when CI fails
370
-
371
- **For Maintainers:**
372
-
373
- 1. Review security reports weekly
374
- 2. Respond to automated security issues promptly
375
- 3. Keep security thresholds updated
376
- 4. Monitor trending vulnerabilities in dependencies
377
-
378
- ### Security Documentation
379
-
380
- Comprehensive security documentation is available in:
381
-
382
- - `.github/SECURITY.md` - Complete security policy and procedures
383
- - Security workflow logs and artifacts
384
- - Generated security reports in CI runs
385
-
386
- The security integration ensures that:
387
-
388
- - No critical vulnerabilities reach production
389
- - Security issues are detected early in development
390
- - Comprehensive audit trails are maintained
391
- - Automated remediation guidance is provided
392
-
393
- ## Important Implementation Patterns
214
+ ## Implementation Patterns
394
215
 
395
216
  ### Tool Structure
396
217
 
397
- All tools follow a consistent class-based pattern:
398
-
399
218
  ```javascript
400
219
  export class ToolName {
401
- constructor(config) {
402
- this.config = config;
403
- // Initialize resources
404
- }
220
+ constructor(config) { this.config = config; }
405
221
 
406
222
  async execute(params) {
407
- // Validate params (Zod validation done in server.js)
408
- // Execute tool logic
409
- // Return structured result
410
223
  return { success: true, data: {...} };
411
224
  }
412
225
 
413
- async destroy() {
414
- // Cleanup resources (browsers, connections, etc.)
415
- }
226
+ async destroy() { /* cleanup resources */ }
416
227
  }
417
228
  ```
418
229
 
419
- ### Search Provider Architecture
420
-
421
- All search requests are proxied through the CrawlForge.dev API:
422
-
423
- - `crawlforgeSearch.js` - Proxies through CrawlForge.dev API (Google Search backend)
424
- - No Google API credentials needed from users
425
- - Users only need their CrawlForge API key
426
- - Credit cost: 2 credits per search
427
-
428
- Factory in `src/tools/search/adapters/searchProviderFactory.js`
429
-
430
- ### Browser Management
431
- - Context isolation per operation for security
432
-
433
- ### Memory Management
434
-
435
- Critical for long-running processes:
436
-
437
- - Graceful shutdown handlers registered for SIGINT/SIGTERM
438
- - All tools with heavy resources must implement `destroy()` or `cleanup()`
439
- - Memory monitoring in development mode (server.js line 1955-1963)
440
- - Force GC on shutdown if available
441
-
442
230
  ### Error Handling Pattern
443
231
 
444
232
  ```javascript
@@ -453,19 +241,10 @@ try {
453
241
  }
454
242
  ```
455
243
 
456
- ### Configuration Validation
457
-
458
- - All config in `src/constants/config.js` with defaults
459
- - `validateConfig()` checks required settings
460
- - Environment variables parsed with fallbacks
461
- - Config errors only fail in production (warnings in dev)
462
-
463
- ## 🎯 Project Management Rules
464
-
465
- ## 🎯 Project Management Rules
244
+ ## Project Management Rules
466
245
 
467
- - always have the project manager work with the appropriate sub agents in parallel
468
- - i want the project manager to always be in charge and then get the appropriate sub agents to work on the tasks in parallel. each sub agent must work on their strengths. when they are done they let the project manager know and the project manager updates the @docs/PRODUCTION_READINESS.md file.
469
- - whenever a phase is completed push all changes to github
470
- - put all the documentation md files into the docs folders to keep everything organized
471
- - every time you finish a phase run npm run build and fix all errors. do this before you push to github.
246
+ - Always have the project manager work with the appropriate sub agents in parallel
247
+ - Each sub agent must work on their strengths; when done they report to the project manager who updates `docs/PRODUCTION_READINESS.md`
248
+ - Whenever a phase is completed, push all changes to GitHub
249
+ - Put all documentation md files into the `docs/` folder
250
+ - Every time you finish a phase run `npm run build` and fix all errors before pushing
package/package.json CHANGED
@@ -1,6 +1,6 @@
1
1
  {
2
2
  "name": "crawlforge-mcp-server",
3
- "version": "3.0.11",
3
+ "version": "3.0.13",
4
4
  "description": "CrawlForge MCP Server - Professional Model Context Protocol server with 19 comprehensive web scraping, crawling, and content processing tools.",
5
5
  "main": "server.js",
6
6
  "bin": {
@@ -9,6 +9,7 @@
9
9
  },
10
10
  "scripts": {
11
11
  "start": "node server.js",
12
+ "start:http": "node server.js --http",
12
13
  "setup": "node setup.js",
13
14
  "dev": "cross-env NODE_ENV=development node server.js",
14
15
  "test": "node tests/integration/mcp-protocol-compliance.test.js",