crawlforge-mcp-server 4.7.1 → 4.8.0
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- package/CLAUDE.md +2 -2
- package/package.json +2 -1
- package/server.js +56 -10
- package/src/cli/commands/init.js +13 -2
- package/src/cli/commands/install-skills.js +10 -1
- package/src/cli/commands/monitor.js +81 -0
- package/src/cli/commands/uninstall-skills.js +10 -1
- package/src/core/ActionExecutor.js +81 -15
- package/src/core/ElicitationHelper.js +18 -5
- package/src/core/LLMsTxtAnalyzer.js +2 -1
- package/src/core/MonitorScheduler.js +281 -0
- package/src/core/MonitorStore.js +79 -0
- package/src/core/ResearchOrchestrator.js +2 -1
- package/src/core/crawlers/BFSCrawler.js +2 -1
- package/src/resources/ResourceRegistry.js +3 -0
- package/src/skills/agent-skills/crawlforge-batch-automation/SKILL.md +126 -0
- package/src/skills/agent-skills/crawlforge-batch-automation/references/actions.md +127 -0
- package/src/skills/agent-skills/crawlforge-change-tracking/SKILL.md +116 -0
- package/src/skills/agent-skills/crawlforge-deep-research/SKILL.md +108 -0
- package/src/skills/agent-skills/crawlforge-deep-research/references/workflows.md +76 -0
- package/src/skills/agent-skills/crawlforge-getting-started/SKILL.md +89 -0
- package/src/skills/agent-skills/crawlforge-getting-started/references/cli.md +71 -0
- package/src/skills/agent-skills/crawlforge-getting-started/references/credits.md +75 -0
- package/src/skills/agent-skills/crawlforge-stealth-browsing/SKILL.md +106 -0
- package/src/skills/agent-skills/crawlforge-stealth-browsing/references/engine-selection.md +63 -0
- package/src/skills/agent-skills/crawlforge-structured-extraction/SKILL.md +121 -0
- package/src/skills/agent-skills/crawlforge-structured-extraction/references/templates.md +39 -0
- package/src/skills/agent-skills/crawlforge-web-scraping/SKILL.md +141 -0
- package/src/skills/agent-skills/crawlforge-web-scraping/references/tool-reference.md +95 -0
- package/src/skills/installer.js +186 -34
- package/src/tools/advanced/ScrapeWithActionsTool.js +7 -0
- package/src/tools/advanced/batchScrape/worker.js +8 -2
- package/src/tools/basic/_fetch.js +14 -1
- package/src/tools/crawl/_sessionContext.js +3 -1
- package/src/tools/extract/_fetchAndParse.js +2 -1
- package/src/tools/extract/extractContent.js +2 -1
- package/src/tools/extract/extractStructured.js +43 -0
- package/src/tools/extract/processDocument.js +2 -1
- package/src/tools/scrape/_brandingExtractor.js +378 -0
- package/src/tools/scrape/unifiedScrape.js +66 -6
- package/src/tools/templates/ScrapeTemplateTool.js +2 -1
- package/src/tools/tracking/trackChanges/differ.js +3 -1
- package/src/tools/tracking/trackChanges/index.js +74 -21
- package/src/tools/tracking/trackChanges/schema.js +7 -2
- package/src/utils/hostRateLimiter.js +46 -0
- package/src/utils/robotsChecker.js +2 -1
- package/src/utils/sitemapParser.js +2 -1
- package/src/utils/ssrfGuard.js +161 -0
- package/src/utils/ssrfProtection.js +6 -9
- package/src/skills/crawlforge-cli.md +0 -157
- package/src/skills/crawlforge-mcp.md +0 -80
- package/src/skills/crawlforge-research.md +0 -104
- package/src/skills/crawlforge-stealth.md +0 -98
|
@@ -1,98 +0,0 @@
|
|
|
1
|
-
# CrawlForge Stealth Mode Guide
|
|
2
|
-
|
|
3
|
-
## When to Use stealth_mode
|
|
4
|
-
|
|
5
|
-
Use `stealth_mode` when a site returns bot-detection errors, 403 responses, CAPTCHAs, or JavaScript-rendered content that `fetch_url` and `extract_content` cannot access.
|
|
6
|
-
|
|
7
|
-
Signs you need stealth mode:
|
|
8
|
-
- Site returns 403 or 429 on regular fetch
|
|
9
|
-
- Content is empty or shows "please enable JavaScript"
|
|
10
|
-
- Site uses Cloudflare, DataDome, PerimeterX, or similar bot protection
|
|
11
|
-
|
|
12
|
-
## Engines
|
|
13
|
-
|
|
14
|
-
### playwright (default)
|
|
15
|
-
- Chromium-based with stealth patches
|
|
16
|
-
- Masks webdriver fingerprints, User-Agent, navigator properties
|
|
17
|
-
- Good for most sites with basic bot detection
|
|
18
|
-
- Lower resource usage
|
|
19
|
-
|
|
20
|
-
### camoufox
|
|
21
|
-
- Firefox-based with native anti-detection
|
|
22
|
-
- No patches applied — uses Firefox's native properties
|
|
23
|
-
- Scores higher on CreepJS and DataDome than patched Chromium
|
|
24
|
-
- Use for sites with advanced fingerprinting (financial, e-commerce)
|
|
25
|
-
|
|
26
|
-
## MCP Tool Usage
|
|
27
|
-
|
|
28
|
-
```json
|
|
29
|
-
// Basic stealth scrape
|
|
30
|
-
{
|
|
31
|
-
"tool": "stealth_mode",
|
|
32
|
-
"params": {
|
|
33
|
-
"url": "https://protected-site.com",
|
|
34
|
-
"engine": "playwright"
|
|
35
|
-
}
|
|
36
|
-
}
|
|
37
|
-
|
|
38
|
-
// Advanced: Camoufox engine with screenshot
|
|
39
|
-
{
|
|
40
|
-
"tool": "stealth_mode",
|
|
41
|
-
"params": {
|
|
42
|
-
"url": "https://heavily-protected-site.com",
|
|
43
|
-
"engine": "camoufox",
|
|
44
|
-
"wait_for": 3000,
|
|
45
|
-
"screenshot": true
|
|
46
|
-
}
|
|
47
|
-
}
|
|
48
|
-
```
|
|
49
|
-
|
|
50
|
-
## CLI Usage
|
|
51
|
-
|
|
52
|
-
```bash
|
|
53
|
-
# Default engine (playwright)
|
|
54
|
-
crawlforge stealth https://protected-site.com
|
|
55
|
-
|
|
56
|
-
# Camoufox for advanced bot detection bypass
|
|
57
|
-
crawlforge stealth https://protected-site.com --engine camoufox
|
|
58
|
-
|
|
59
|
-
# Wait for JS-heavy page to render, capture screenshot
|
|
60
|
-
crawlforge stealth https://spa-site.com --wait 3000 --screenshot
|
|
61
|
-
|
|
62
|
-
# Output as JSON
|
|
63
|
-
crawlforge stealth https://example.com --json
|
|
64
|
-
```
|
|
65
|
-
|
|
66
|
-
## Engine Selection Guide
|
|
67
|
-
|
|
68
|
-
| Scenario | Recommended Engine |
|
|
69
|
-
|----------|-------------------|
|
|
70
|
-
| General JS-rendered sites | playwright |
|
|
71
|
-
| Cloudflare-protected sites | camoufox |
|
|
72
|
-
| Sites with DataDome | camoufox |
|
|
73
|
-
| Sites with PerimeterX | camoufox |
|
|
74
|
-
| Financial/trading sites | camoufox |
|
|
75
|
-
| Speed-critical scraping | playwright |
|
|
76
|
-
| Basic bot detection bypass | playwright |
|
|
77
|
-
|
|
78
|
-
## Environment Variable
|
|
79
|
-
|
|
80
|
-
Force engine globally:
|
|
81
|
-
```bash
|
|
82
|
-
export CRAWLFORGE_STEALTH_ENGINE=camoufox
|
|
83
|
-
```
|
|
84
|
-
|
|
85
|
-
## Combining with Other Tools
|
|
86
|
-
|
|
87
|
-
After extracting raw HTML via stealth_mode, pipe to analyze_content or extract_structured:
|
|
88
|
-
```json
|
|
89
|
-
// Step 1: get HTML via stealth
|
|
90
|
-
{ "tool": "stealth_mode", "params": { "url": "https://example.com" } }
|
|
91
|
-
|
|
92
|
-
// Step 2: extract structured data from the result
|
|
93
|
-
{ "tool": "extract_structured", "params": { "url": "https://example.com", "schema": {...} } }
|
|
94
|
-
```
|
|
95
|
-
|
|
96
|
-
## Credits
|
|
97
|
-
- `stealth_mode`: 5 credits per call
|
|
98
|
-
- Additional costs for screenshots (1 extra credit per screenshot)
|