crawlforge-mcp-server 4.7.2 → 4.8.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
Files changed (50) hide show
  1. package/CLAUDE.md +2 -2
  2. package/package.json +2 -1
  3. package/server.js +42 -9
  4. package/src/cli/commands/init.js +13 -2
  5. package/src/cli/commands/install-skills.js +10 -1
  6. package/src/cli/commands/monitor.js +81 -0
  7. package/src/cli/commands/uninstall-skills.js +10 -1
  8. package/src/core/ActionExecutor.js +51 -9
  9. package/src/core/ElicitationHelper.js +18 -5
  10. package/src/core/LLMsTxtAnalyzer.js +2 -1
  11. package/src/core/MonitorScheduler.js +281 -0
  12. package/src/core/MonitorStore.js +79 -0
  13. package/src/core/ResearchOrchestrator.js +2 -1
  14. package/src/core/crawlers/BFSCrawler.js +2 -1
  15. package/src/skills/agent-skills/crawlforge-batch-automation/SKILL.md +126 -0
  16. package/src/skills/agent-skills/crawlforge-batch-automation/references/actions.md +127 -0
  17. package/src/skills/agent-skills/crawlforge-change-tracking/SKILL.md +116 -0
  18. package/src/skills/agent-skills/crawlforge-deep-research/SKILL.md +108 -0
  19. package/src/skills/agent-skills/crawlforge-deep-research/references/workflows.md +76 -0
  20. package/src/skills/agent-skills/crawlforge-getting-started/SKILL.md +89 -0
  21. package/src/skills/agent-skills/crawlforge-getting-started/references/cli.md +71 -0
  22. package/src/skills/agent-skills/crawlforge-getting-started/references/credits.md +75 -0
  23. package/src/skills/agent-skills/crawlforge-stealth-browsing/SKILL.md +106 -0
  24. package/src/skills/agent-skills/crawlforge-stealth-browsing/references/engine-selection.md +63 -0
  25. package/src/skills/agent-skills/crawlforge-structured-extraction/SKILL.md +121 -0
  26. package/src/skills/agent-skills/crawlforge-structured-extraction/references/templates.md +39 -0
  27. package/src/skills/agent-skills/crawlforge-web-scraping/SKILL.md +141 -0
  28. package/src/skills/agent-skills/crawlforge-web-scraping/references/tool-reference.md +95 -0
  29. package/src/skills/installer.js +186 -34
  30. package/src/tools/advanced/batchScrape/worker.js +8 -2
  31. package/src/tools/basic/_fetch.js +14 -1
  32. package/src/tools/crawl/_sessionContext.js +3 -1
  33. package/src/tools/extract/_fetchAndParse.js +2 -1
  34. package/src/tools/extract/extractContent.js +2 -1
  35. package/src/tools/extract/processDocument.js +2 -1
  36. package/src/tools/scrape/_brandingExtractor.js +378 -0
  37. package/src/tools/scrape/unifiedScrape.js +66 -6
  38. package/src/tools/templates/ScrapeTemplateTool.js +2 -1
  39. package/src/tools/tracking/trackChanges/differ.js +3 -1
  40. package/src/tools/tracking/trackChanges/index.js +74 -21
  41. package/src/tools/tracking/trackChanges/schema.js +7 -2
  42. package/src/utils/hostRateLimiter.js +46 -0
  43. package/src/utils/robotsChecker.js +2 -1
  44. package/src/utils/sitemapParser.js +2 -1
  45. package/src/utils/ssrfGuard.js +161 -0
  46. package/src/utils/ssrfProtection.js +6 -9
  47. package/src/skills/crawlforge-cli.md +0 -157
  48. package/src/skills/crawlforge-mcp.md +0 -80
  49. package/src/skills/crawlforge-research.md +0 -104
  50. package/src/skills/crawlforge-stealth.md +0 -98
@@ -1,98 +0,0 @@
1
- # CrawlForge Stealth Mode Guide
2
-
3
- ## When to Use stealth_mode
4
-
5
- Use `stealth_mode` when a site returns bot-detection errors, 403 responses, CAPTCHAs, or JavaScript-rendered content that `fetch_url` and `extract_content` cannot access.
6
-
7
- Signs you need stealth mode:
8
- - Site returns 403 or 429 on regular fetch
9
- - Content is empty or shows "please enable JavaScript"
10
- - Site uses Cloudflare, DataDome, PerimeterX, or similar bot protection
11
-
12
- ## Engines
13
-
14
- ### playwright (default)
15
- - Chromium-based with stealth patches
16
- - Masks webdriver fingerprints, User-Agent, navigator properties
17
- - Good for most sites with basic bot detection
18
- - Lower resource usage
19
-
20
- ### camoufox
21
- - Firefox-based with native anti-detection
22
- - No patches applied — uses Firefox's native properties
23
- - Scores higher on CreepJS and DataDome than patched Chromium
24
- - Use for sites with advanced fingerprinting (financial, e-commerce)
25
-
26
- ## MCP Tool Usage
27
-
28
- ```json
29
- // Basic stealth scrape
30
- {
31
- "tool": "stealth_mode",
32
- "params": {
33
- "url": "https://protected-site.com",
34
- "engine": "playwright"
35
- }
36
- }
37
-
38
- // Advanced: Camoufox engine with screenshot
39
- {
40
- "tool": "stealth_mode",
41
- "params": {
42
- "url": "https://heavily-protected-site.com",
43
- "engine": "camoufox",
44
- "wait_for": 3000,
45
- "screenshot": true
46
- }
47
- }
48
- ```
49
-
50
- ## CLI Usage
51
-
52
- ```bash
53
- # Default engine (playwright)
54
- crawlforge stealth https://protected-site.com
55
-
56
- # Camoufox for advanced bot detection bypass
57
- crawlforge stealth https://protected-site.com --engine camoufox
58
-
59
- # Wait for JS-heavy page to render, capture screenshot
60
- crawlforge stealth https://spa-site.com --wait 3000 --screenshot
61
-
62
- # Output as JSON
63
- crawlforge stealth https://example.com --json
64
- ```
65
-
66
- ## Engine Selection Guide
67
-
68
- | Scenario | Recommended Engine |
69
- |----------|-------------------|
70
- | General JS-rendered sites | playwright |
71
- | Cloudflare-protected sites | camoufox |
72
- | Sites with DataDome | camoufox |
73
- | Sites with PerimeterX | camoufox |
74
- | Financial/trading sites | camoufox |
75
- | Speed-critical scraping | playwright |
76
- | Basic bot detection bypass | playwright |
77
-
78
- ## Environment Variable
79
-
80
- Force engine globally:
81
- ```bash
82
- export CRAWLFORGE_STEALTH_ENGINE=camoufox
83
- ```
84
-
85
- ## Combining with Other Tools
86
-
87
- After extracting raw HTML via stealth_mode, pipe to analyze_content or extract_structured:
88
- ```json
89
- // Step 1: get HTML via stealth
90
- { "tool": "stealth_mode", "params": { "url": "https://example.com" } }
91
-
92
- // Step 2: extract structured data from the result
93
- { "tool": "extract_structured", "params": { "url": "https://example.com", "schema": {...} } }
94
- ```
95
-
96
- ## Credits
97
- - `stealth_mode`: 5 credits per call
98
- - Additional costs for screenshots (1 extra credit per screenshot)