crawlforge-mcp-server 4.7.1 → 4.8.0
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- package/CLAUDE.md +2 -2
- package/package.json +2 -1
- package/server.js +56 -10
- package/src/cli/commands/init.js +13 -2
- package/src/cli/commands/install-skills.js +10 -1
- package/src/cli/commands/monitor.js +81 -0
- package/src/cli/commands/uninstall-skills.js +10 -1
- package/src/core/ActionExecutor.js +81 -15
- package/src/core/ElicitationHelper.js +18 -5
- package/src/core/LLMsTxtAnalyzer.js +2 -1
- package/src/core/MonitorScheduler.js +281 -0
- package/src/core/MonitorStore.js +79 -0
- package/src/core/ResearchOrchestrator.js +2 -1
- package/src/core/crawlers/BFSCrawler.js +2 -1
- package/src/resources/ResourceRegistry.js +3 -0
- package/src/skills/agent-skills/crawlforge-batch-automation/SKILL.md +126 -0
- package/src/skills/agent-skills/crawlforge-batch-automation/references/actions.md +127 -0
- package/src/skills/agent-skills/crawlforge-change-tracking/SKILL.md +116 -0
- package/src/skills/agent-skills/crawlforge-deep-research/SKILL.md +108 -0
- package/src/skills/agent-skills/crawlforge-deep-research/references/workflows.md +76 -0
- package/src/skills/agent-skills/crawlforge-getting-started/SKILL.md +89 -0
- package/src/skills/agent-skills/crawlforge-getting-started/references/cli.md +71 -0
- package/src/skills/agent-skills/crawlforge-getting-started/references/credits.md +75 -0
- package/src/skills/agent-skills/crawlforge-stealth-browsing/SKILL.md +106 -0
- package/src/skills/agent-skills/crawlforge-stealth-browsing/references/engine-selection.md +63 -0
- package/src/skills/agent-skills/crawlforge-structured-extraction/SKILL.md +121 -0
- package/src/skills/agent-skills/crawlforge-structured-extraction/references/templates.md +39 -0
- package/src/skills/agent-skills/crawlforge-web-scraping/SKILL.md +141 -0
- package/src/skills/agent-skills/crawlforge-web-scraping/references/tool-reference.md +95 -0
- package/src/skills/installer.js +186 -34
- package/src/tools/advanced/ScrapeWithActionsTool.js +7 -0
- package/src/tools/advanced/batchScrape/worker.js +8 -2
- package/src/tools/basic/_fetch.js +14 -1
- package/src/tools/crawl/_sessionContext.js +3 -1
- package/src/tools/extract/_fetchAndParse.js +2 -1
- package/src/tools/extract/extractContent.js +2 -1
- package/src/tools/extract/extractStructured.js +43 -0
- package/src/tools/extract/processDocument.js +2 -1
- package/src/tools/scrape/_brandingExtractor.js +378 -0
- package/src/tools/scrape/unifiedScrape.js +66 -6
- package/src/tools/templates/ScrapeTemplateTool.js +2 -1
- package/src/tools/tracking/trackChanges/differ.js +3 -1
- package/src/tools/tracking/trackChanges/index.js +74 -21
- package/src/tools/tracking/trackChanges/schema.js +7 -2
- package/src/utils/hostRateLimiter.js +46 -0
- package/src/utils/robotsChecker.js +2 -1
- package/src/utils/sitemapParser.js +2 -1
- package/src/utils/ssrfGuard.js +161 -0
- package/src/utils/ssrfProtection.js +6 -9
- package/src/skills/crawlforge-cli.md +0 -157
- package/src/skills/crawlforge-mcp.md +0 -80
- package/src/skills/crawlforge-research.md +0 -104
- package/src/skills/crawlforge-stealth.md +0 -98
|
@@ -0,0 +1,116 @@
|
|
|
1
|
+
---
|
|
2
|
+
name: crawlforge-change-tracking
|
|
3
|
+
description: "Monitors web pages for changes over time with CrawlForge's track_changes tool. Use when the user wants to track changes to a page, watch a URL, monitor competitor pricing, detect when content updates, get notified of regulation or product-availability changes, or diff a page against a saved baseline. Workflow: create a baseline with operation create_baseline, then periodically compare with operation compare to get a change percentage and a diff; supports CSS-selector scoping, webhooks, and scheduled monitoring."
|
|
4
|
+
metadata:
|
|
5
|
+
version: 4.8.0
|
|
6
|
+
source: crawlforge-mcp-server
|
|
7
|
+
---
|
|
8
|
+
|
|
9
|
+
# CrawlForge Change Tracking
|
|
10
|
+
|
|
11
|
+
Detect when a web page changes over time using the `track_changes` tool. Useful
|
|
12
|
+
for competitor pricing, regulation updates, product availability, and any page
|
|
13
|
+
you need to watch for edits.
|
|
14
|
+
|
|
15
|
+
## When to use
|
|
16
|
+
|
|
17
|
+
- "Track changes to this page" / "watch this URL"
|
|
18
|
+
- "Tell me when competitor pricing changes"
|
|
19
|
+
- "Detect when this content updates"
|
|
20
|
+
- "Diff this page against last week's version"
|
|
21
|
+
- "Notify me when product availability / a regulation changes"
|
|
22
|
+
|
|
23
|
+
## Core workflow
|
|
24
|
+
|
|
25
|
+
`track_changes` is one tool driven by an `operation` parameter (cost: 3 credits
|
|
26
|
+
per call).
|
|
27
|
+
|
|
28
|
+
1. **Create a baseline** the first time:
|
|
29
|
+
|
|
30
|
+
```json
|
|
31
|
+
{ "tool": "track_changes", "params": { "url": "https://example.com/pricing", "operation": "create_baseline" } }
|
|
32
|
+
```
|
|
33
|
+
|
|
34
|
+
2. **Compare** later to get the change percentage + diff:
|
|
35
|
+
|
|
36
|
+
```json
|
|
37
|
+
{ "tool": "track_changes", "params": { "url": "https://example.com/pricing", "operation": "compare" } }
|
|
38
|
+
```
|
|
39
|
+
|
|
40
|
+
`compare` is the default operation. It returns a change percentage and a
|
|
41
|
+
structured diff against the stored baseline.
|
|
42
|
+
|
|
43
|
+
## Scoping to part of a page
|
|
44
|
+
|
|
45
|
+
Use `trackingOptions` to ignore noise and focus on what matters:
|
|
46
|
+
|
|
47
|
+
```json
|
|
48
|
+
{
|
|
49
|
+
"tool": "track_changes",
|
|
50
|
+
"params": {
|
|
51
|
+
"url": "https://example.com/pricing",
|
|
52
|
+
"operation": "compare",
|
|
53
|
+
"trackingOptions": {
|
|
54
|
+
"granularity": "element",
|
|
55
|
+
"customSelectors": [".price", ".plan-name"],
|
|
56
|
+
"excludeSelectors": [".timestamp", ".ad"],
|
|
57
|
+
"ignoreWhitespace": true
|
|
58
|
+
}
|
|
59
|
+
}
|
|
60
|
+
}
|
|
61
|
+
```
|
|
62
|
+
|
|
63
|
+
`granularity`: `page`, `section` (default), `element`, `text`. Toggle
|
|
64
|
+
`trackText`, `trackStructure`, `trackLinks`, `trackImages`. Set
|
|
65
|
+
`significanceThresholds` (`minor`/`moderate`/`major`) to classify change size.
|
|
66
|
+
|
|
67
|
+
CLI: `crawlforge track https://example.com --selector ".price" --threshold 1`.
|
|
68
|
+
|
|
69
|
+
## Scheduled monitoring & notifications
|
|
70
|
+
|
|
71
|
+
Run continuous monitoring with webhooks instead of polling manually:
|
|
72
|
+
|
|
73
|
+
```json
|
|
74
|
+
{
|
|
75
|
+
"tool": "track_changes",
|
|
76
|
+
"params": {
|
|
77
|
+
"url": "https://example.com/pricing",
|
|
78
|
+
"operation": "create_scheduled_monitor",
|
|
79
|
+
"monitoringOptions": {
|
|
80
|
+
"enabled": true,
|
|
81
|
+
"interval": 300000,
|
|
82
|
+
"notificationThreshold": "moderate",
|
|
83
|
+
"enableWebhook": true,
|
|
84
|
+
"webhookUrl": "https://my-site.com/notify"
|
|
85
|
+
}
|
|
86
|
+
}
|
|
87
|
+
}
|
|
88
|
+
```
|
|
89
|
+
|
|
90
|
+
CLI (runs until Ctrl+C):
|
|
91
|
+
`crawlforge monitor https://example.com --interval 60 --webhook https://my-site.com/notify`.
|
|
92
|
+
|
|
93
|
+
## Other operations
|
|
94
|
+
|
|
95
|
+
| Operation | Purpose |
|
|
96
|
+
|-----------|---------|
|
|
97
|
+
| `create_baseline` | Save the first snapshot to diff against. |
|
|
98
|
+
| `compare` (default) | Diff current content vs baseline → % change + diff. |
|
|
99
|
+
| `monitor` | One monitoring pass. |
|
|
100
|
+
| `get_history` | Retrieve past change records (`queryOptions`). |
|
|
101
|
+
| `get_stats` | Summary statistics for a tracked URL. |
|
|
102
|
+
| `create_scheduled_monitor` / `stop_scheduled_monitor` | Manage cron-style monitors. |
|
|
103
|
+
| `get_dashboard` | Aggregate status, recent alerts, trends. |
|
|
104
|
+
| `export_history` | Export change history as `json` or `csv`. |
|
|
105
|
+
| `create_alert_rule` | Conditional alerts (webhook / email / slack). |
|
|
106
|
+
| `generate_trend_report` | Trend analysis over time. |
|
|
107
|
+
| `get_monitoring_templates` | List built-in monitoring presets. |
|
|
108
|
+
|
|
109
|
+
You can also pass `content` or `html` directly to compare pre-fetched content
|
|
110
|
+
against the baseline without re-fetching.
|
|
111
|
+
|
|
112
|
+
## Cost note
|
|
113
|
+
|
|
114
|
+
`track_changes` = 3 credits per call. A typical watch is one `create_baseline`
|
|
115
|
+
plus periodic `compare` calls; scheduled monitors run server-side and notify via
|
|
116
|
+
webhook/Slack so you don't pay for manual polling loops.
|
|
@@ -0,0 +1,108 @@
|
|
|
1
|
+
---
|
|
2
|
+
name: crawlforge-deep-research
|
|
3
|
+
description: "Runs multi-source web research and autonomous question-answering with CrawlForge's deep_research, agent, and search_web tools. Use when the user wants to research a topic, do a deep dive, compare competitors, gather facts with citations, answer a question from the web, search the web, or get a synthesized report from many sources. deep_research expands queries, fetches and dedupes sources, then synthesizes; agent autonomously plans searches and navigation from a plain-English prompt with no URLs needed; search_web returns ranked results. Caps costs via max_urls and confirms before expensive runs."
|
|
4
|
+
metadata:
|
|
5
|
+
version: 4.8.0
|
|
6
|
+
source: crawlforge-mcp-server
|
|
7
|
+
---
|
|
8
|
+
|
|
9
|
+
# CrawlForge Deep Research
|
|
10
|
+
|
|
11
|
+
Answer questions and produce reports from many web sources using the CrawlForge
|
|
12
|
+
MCP server. Three tools, from lightest to heaviest: `search_web` (ranked
|
|
13
|
+
results), `agent` (autonomous plan-and-answer), `deep_research` (exhaustive
|
|
14
|
+
multi-source synthesis).
|
|
15
|
+
|
|
16
|
+
## When to use
|
|
17
|
+
|
|
18
|
+
- "Search the web for X" / "find pages about X" → `search_web`
|
|
19
|
+
- "Answer this question from the web" / "what are the top 5 X" (no URLs given) → `agent`
|
|
20
|
+
- "Research this topic in depth" / "compare competitors" / "give me a cited
|
|
21
|
+
report" / "gather facts with sources" → `deep_research`
|
|
22
|
+
|
|
23
|
+
Do NOT use these for a single known URL — use the **crawlforge-web-scraping**
|
|
24
|
+
skill (`scrape` / `extract_content`) instead.
|
|
25
|
+
|
|
26
|
+
## Tool selection
|
|
27
|
+
|
|
28
|
+
| Need | Tool | Cost |
|
|
29
|
+
|------|------|------|
|
|
30
|
+
| A ranked list of result URLs + snippets | `search_web` | 5 |
|
|
31
|
+
| A direct answer, agent decides what to read | `agent` | 8 (scales) |
|
|
32
|
+
| A synthesized, multi-source, cited report | `deep_research` | 10+ (scales) |
|
|
33
|
+
|
|
34
|
+
A common pipeline: `search_web` to find sources → `batch_scrape` (see
|
|
35
|
+
**crawlforge-batch-automation**) to read them → `deep_research` to synthesize.
|
|
36
|
+
|
|
37
|
+
## search_web (cost: 5)
|
|
38
|
+
|
|
39
|
+
```json
|
|
40
|
+
{
|
|
41
|
+
"tool": "search_web",
|
|
42
|
+
"params": { "query": "best MCP servers 2025", "limit": 10, "time_range": "month" }
|
|
43
|
+
}
|
|
44
|
+
```
|
|
45
|
+
|
|
46
|
+
Returns titles, URLs, snippets. Supports `lang`, `site` (domain filter),
|
|
47
|
+
`file_type`, `time_range` (`day`/`week`/`month`/`year`/`all`), `expand_query`,
|
|
48
|
+
`enable_ranking`, and `enable_deduplication`. CLI:
|
|
49
|
+
`crawlforge search "MCP server tutorial" --limit 5`.
|
|
50
|
+
|
|
51
|
+
## agent — autonomous answer (cost: 8, scales with maxUrls)
|
|
52
|
+
|
|
53
|
+
No URLs required. The agent plans search queries, fetches and filters relevant
|
|
54
|
+
pages, then returns a prose or structured answer.
|
|
55
|
+
|
|
56
|
+
```json
|
|
57
|
+
{
|
|
58
|
+
"tool": "agent",
|
|
59
|
+
"params": { "prompt": "What are the top 5 MCP servers in 2025 and who maintains them?", "maxUrls": 10 }
|
|
60
|
+
}
|
|
61
|
+
```
|
|
62
|
+
|
|
63
|
+
- `model:"default"` runs a SamplingClient loop (no LLM keys needed).
|
|
64
|
+
- `model:"pro"` runs the full ResearchOrchestrator (deeper, multi-source); it
|
|
65
|
+
asks for confirmation (elicitation) before running.
|
|
66
|
+
- Pass `schema` for structured JSON output, `urls` for optional seed URLs.
|
|
67
|
+
- Hard limits enforced by the orchestrator: `maxSteps` ≤ 10, `maxUrls` ≤ 20,
|
|
68
|
+
120s wall-clock. Output is degraded-but-useful if no LLM keys/Ollama exist.
|
|
69
|
+
|
|
70
|
+
## deep_research — exhaustive report (cost: 10+, scales)
|
|
71
|
+
|
|
72
|
+
```json
|
|
73
|
+
{
|
|
74
|
+
"tool": "deep_research",
|
|
75
|
+
"params": {
|
|
76
|
+
"topic": "React vs Vue vs Angular in 2025",
|
|
77
|
+
"maxUrls": 20,
|
|
78
|
+
"researchApproach": "comparative",
|
|
79
|
+
"outputFormat": "comprehensive"
|
|
80
|
+
}
|
|
81
|
+
}
|
|
82
|
+
```
|
|
83
|
+
|
|
84
|
+
How it works: query expansion → parallel fetch (deduped) → content extraction →
|
|
85
|
+
conflict detection → synthesis. `researchApproach`: `broad`, `focused`,
|
|
86
|
+
`academic`, `current_events`, `comparative`. `outputFormat`: `comprehensive`,
|
|
87
|
+
`summary`, `citations_only`, `conflicts_focus`. Results are stored as a
|
|
88
|
+
`crawlforge://research/{sessionId}` resource — re-read them without re-running.
|
|
89
|
+
|
|
90
|
+
CLI: `crawlforge research "B2B SaaS pricing trends" --depth deep --max-urls 30`.
|
|
91
|
+
|
|
92
|
+
### Cost control
|
|
93
|
+
|
|
94
|
+
`deep_research` starts at 10 credits and grows with sources. **`maxUrls > 50`
|
|
95
|
+
triggers a confirmation prompt (elicitation).** Cap cost with `maxUrls`
|
|
96
|
+
(e.g. `maxUrls:10` ≈ 12 credits). `timeLimit` (default 120000 ms) bounds
|
|
97
|
+
runtime.
|
|
98
|
+
|
|
99
|
+
## Synthesis & LLM fallback
|
|
100
|
+
|
|
101
|
+
The server ships with **no LLM key by design**. The fallback chain is:
|
|
102
|
+
Ollama (local) → OpenAI key → Anthropic key → MCP Sampling → raw evidence.
|
|
103
|
+
With no LLM configured, `deep_research` returns structured **raw evidence** for
|
|
104
|
+
the calling assistant (e.g. Claude Code) to synthesize — this is expected, do
|
|
105
|
+
not suggest adding API keys.
|
|
106
|
+
|
|
107
|
+
See [research workflows](references/workflows.md) for pipelines, depth tiers,
|
|
108
|
+
and parameter detail.
|
|
@@ -0,0 +1,76 @@
|
|
|
1
|
+
# Deep Research — Workflows & Parameters
|
|
2
|
+
|
|
3
|
+
## Pipelines
|
|
4
|
+
|
|
5
|
+
### Targeted competitive research
|
|
6
|
+
1. `search_web` `"competitor X pricing"` → candidate URLs.
|
|
7
|
+
2. `batch_scrape` the URLs → content in parallel (see crawlforge-batch-automation).
|
|
8
|
+
3. `deep_research` `"competitor X vs us"` → synthesized comparison.
|
|
9
|
+
|
|
10
|
+
### Quick fact-find
|
|
11
|
+
- `agent` with `model:"default"` and a tight `maxUrls` (5–8). Cheapest path to a
|
|
12
|
+
cited answer when you don't need a full report.
|
|
13
|
+
|
|
14
|
+
### Exhaustive report
|
|
15
|
+
- `deep_research` with `researchApproach` matched to intent and `maxUrls`
|
|
16
|
+
sized to budget.
|
|
17
|
+
|
|
18
|
+
## deep_research parameters (cost: 10+, scales with maxUrls)
|
|
19
|
+
|
|
20
|
+
| Param | Type | Default | Notes |
|
|
21
|
+
|-------|------|---------|-------|
|
|
22
|
+
| `topic` | string | — | Required, 3–500 chars. |
|
|
23
|
+
| `maxDepth` | number | `5` | 1–10. |
|
|
24
|
+
| `maxUrls` | number | `50` | 1–1000. **>50 triggers elicitation.** |
|
|
25
|
+
| `timeLimit` | number | `120000` | 30000–300000 ms. |
|
|
26
|
+
| `researchApproach` | enum | `broad` | `broad`, `focused`, `academic`, `current_events`, `comparative`. |
|
|
27
|
+
| `sourceTypes` | string[] | `["any"]` | `academic`, `news`, `government`, `commercial`, `blog`, `wiki`, `any`. |
|
|
28
|
+
| `credibilityThreshold` | number | `0.3` | 0–1 minimum source credibility. |
|
|
29
|
+
| `enableConflictDetection` | boolean | `true` | Flag contradictory claims. |
|
|
30
|
+
| `enableSourceVerification` | boolean | `true` | Score source credibility. |
|
|
31
|
+
| `enableSynthesis` | boolean | `true` | Build a coherent report. |
|
|
32
|
+
| `outputFormat` | enum | `comprehensive` | `comprehensive`, `summary`, `citations_only`, `conflicts_focus`. |
|
|
33
|
+
| `includeRawData` | boolean | `false` | Include raw scraped data. |
|
|
34
|
+
| `concurrency` | number | `5` | 1–20. |
|
|
35
|
+
| `webhook` | object | — | `started`/`progress`/`completed`/`failed` callbacks. |
|
|
36
|
+
|
|
37
|
+
Results stored at `crawlforge://research/{sessionId}` (list via `resources/list`).
|
|
38
|
+
|
|
39
|
+
## Approximate depth tiers (CLI `--depth`)
|
|
40
|
+
|
|
41
|
+
| Depth | URLs | Use case | Approx. credits |
|
|
42
|
+
|-------|------|----------|-----------------|
|
|
43
|
+
| basic | 5–10 | Quick overview | 10–15 |
|
|
44
|
+
| standard | 15–25 | General research | 15–30 |
|
|
45
|
+
| deep | 30–75 | Comprehensive analysis | 30–75+ |
|
|
46
|
+
|
|
47
|
+
## agent parameters (cost: 8, scales with maxUrls)
|
|
48
|
+
|
|
49
|
+
| Param | Type | Default | Notes |
|
|
50
|
+
|-------|------|---------|-------|
|
|
51
|
+
| `prompt` | string | — | Required, 1–2000 chars. |
|
|
52
|
+
| `urls` | string[] | — | Optional seed URLs, max 20. |
|
|
53
|
+
| `schema` | object | — | JSON schema for structured output. |
|
|
54
|
+
| `model` | enum | `default` | `default` = SamplingClient loop; `pro` = ResearchOrchestrator (confirms first). |
|
|
55
|
+
| `maxSteps` | number | `5` | Hard cap 10. |
|
|
56
|
+
| `maxUrls` | number | `10` | Hard cap 20. |
|
|
57
|
+
|
|
58
|
+
Orchestrator-enforced hard stops (never delegated to the LLM): maxSteps ≤ 10,
|
|
59
|
+
maxUrls ≤ 20, 120s wall-clock.
|
|
60
|
+
|
|
61
|
+
## search_web parameters (cost: 5)
|
|
62
|
+
|
|
63
|
+
| Param | Type | Notes |
|
|
64
|
+
|-------|------|-------|
|
|
65
|
+
| `query` | string | Required. |
|
|
66
|
+
| `limit` | number | 1–100. |
|
|
67
|
+
| `offset` | number | Pagination. |
|
|
68
|
+
| `lang` | string | e.g. `en`, `fr`. |
|
|
69
|
+
| `time_range` | enum | `day`, `week`, `month`, `year`, `all`. |
|
|
70
|
+
| `site` | string | Restrict to a domain. |
|
|
71
|
+
| `file_type` | string | e.g. `pdf`, `doc`. |
|
|
72
|
+
| `provider` | enum | `crawlforge` (default) or `searxng`. |
|
|
73
|
+
| `expand_query` | boolean | Synonyms/stemming/etc. |
|
|
74
|
+
| `enable_ranking` | boolean | BM25 + signal re-ranking. |
|
|
75
|
+
| `enable_deduplication` | boolean | Drop near-duplicates. |
|
|
76
|
+
| `localization` | object | `countryCode`, `language`, `timezone`, geo-targeting. |
|
|
@@ -0,0 +1,89 @@
|
|
|
1
|
+
---
|
|
2
|
+
name: crawlforge-getting-started
|
|
3
|
+
description: "Orientation and tool-selection guide for the CrawlForge MCP server's 26 web tools. Use when the user is getting started with CrawlForge, asks which CrawlForge tool to use, how to set up the API key, how skills or the CLI work, what a tool costs in credits, or when one tool fails and a fallback is needed. Routes requests to the right specialized skill (web scraping, deep research, stealth, structured extraction, change tracking, batch automation), and explains MCP-tools-vs-CLI, the Ollama-first LLM fallback chain, and per-tool credit costs."
|
|
4
|
+
metadata:
|
|
5
|
+
version: 4.8.0
|
|
6
|
+
source: crawlforge-mcp-server
|
|
7
|
+
---
|
|
8
|
+
|
|
9
|
+
# CrawlForge: Getting Started
|
|
10
|
+
|
|
11
|
+
CrawlForge is an MCP server with **26 tools** for web scraping, crawling,
|
|
12
|
+
extraction, research, change tracking, and AI-compliance. This skill orients you
|
|
13
|
+
and routes each request to the right specialized skill.
|
|
14
|
+
|
|
15
|
+
## Setup
|
|
16
|
+
|
|
17
|
+
1. Get an API key at https://crawlforge.dev/signup (1,000 free credits).
|
|
18
|
+
2. Provide it to the server:
|
|
19
|
+
|
|
20
|
+
```bash
|
|
21
|
+
export CRAWLFORGE_API_KEY="your_api_key_here"
|
|
22
|
+
# or, for the CLI, run: crawlforge init (detects key, installs skills, merges MCP config)
|
|
23
|
+
```
|
|
24
|
+
|
|
25
|
+
Every tool is metered and requires a key — there is no free tier. User config is
|
|
26
|
+
stored at `~/.crawlforge/config.json`.
|
|
27
|
+
|
|
28
|
+
## Route to the right skill
|
|
29
|
+
|
|
30
|
+
| The user wants to... | Use skill |
|
|
31
|
+
|----------------------|-----------|
|
|
32
|
+
| Scrape a page, get markdown/text/links/metadata, map or crawl a site | **crawlforge-web-scraping** |
|
|
33
|
+
| Research a topic, search the web, get a cited report, autonomous Q&A | **crawlforge-deep-research** |
|
|
34
|
+
| Get past 403/CAPTCHA/Cloudflare, or emulate a region/locale | **crawlforge-stealth-browsing** |
|
|
35
|
+
| Extract JSON/fields, parse a PDF, summarize, analyze sentiment | **crawlforge-structured-extraction** |
|
|
36
|
+
| Watch a page for changes / monitor pricing | **crawlforge-change-tracking** |
|
|
37
|
+
| Scrape many URLs, run browser actions, generate llms.txt | **crawlforge-batch-automation** |
|
|
38
|
+
|
|
39
|
+
## The 26 tools at a glance
|
|
40
|
+
|
|
41
|
+
- **Basic (5):** fetch_url, extract_text, extract_links, extract_metadata, scrape_structured
|
|
42
|
+
- **Unified (1):** scrape (multi-format single fetch)
|
|
43
|
+
- **Search & research (3):** search_web, deep_research, agent
|
|
44
|
+
- **Crawl (2):** crawl_deep, map_site
|
|
45
|
+
- **Extract & analyze (7):** extract_content, process_document, summarize_content, analyze_content, extract_structured, extract_with_llm, list_ollama_models
|
|
46
|
+
- **Batch & automation (4):** batch_scrape, get_batch_results, scrape_with_actions, generate_llms_txt
|
|
47
|
+
- **Stealth & locale (2):** stealth_mode, localization
|
|
48
|
+
- **Templates & tracking (2):** scrape_template, track_changes
|
|
49
|
+
|
|
50
|
+
## MCP tools vs CLI
|
|
51
|
+
|
|
52
|
+
- **MCP tools** — call inline within an AI assistant session (Claude Code,
|
|
53
|
+
Cursor, etc.). This is the default in chat.
|
|
54
|
+
- **CLI** (`crawlforge <command>`) — for scripts, CI, and pipelines. 15 tool
|
|
55
|
+
commands + 2 skill commands. See [cli](references/cli.md).
|
|
56
|
+
|
|
57
|
+
Both hit the same backend and consume the same credits.
|
|
58
|
+
|
|
59
|
+
## LLM fallback chain (Ollama-first)
|
|
60
|
+
|
|
61
|
+
LLM-backed tools (`extract_with_llm`, `extract_structured`, abstractive
|
|
62
|
+
`summarize_content`, `deep_research`, `agent`) resolve a provider in this order:
|
|
63
|
+
|
|
64
|
+
```
|
|
65
|
+
Ollama (local, default, no key) → OpenAI key → Anthropic key → MCP Sampling → raw evidence
|
|
66
|
+
```
|
|
67
|
+
|
|
68
|
+
The server ships with **no cloud LLM key by design**. With none configured,
|
|
69
|
+
research tools return raw evidence for the calling assistant to synthesize.
|
|
70
|
+
Do not suggest adding API keys — local Ollama is the intended zero-cost default.
|
|
71
|
+
|
|
72
|
+
## When a tool fails — fallbacks
|
|
73
|
+
|
|
74
|
+
| Symptom | Try next |
|
|
75
|
+
|---------|----------|
|
|
76
|
+
| `scrape` / `fetch_url` returns 403, 429, CAPTCHA, or empty JS shell | `stealth_mode` (crawlforge-stealth-browsing) |
|
|
77
|
+
| No template for a known site | `scrape_structured` → `extract_structured` → `extract_with_llm` |
|
|
78
|
+
| LLM extraction unavailable (no Ollama/keys) | `scrape_structured` with CSS selectors |
|
|
79
|
+
| Single page too slow / many pages | `batch_scrape` (async + webhook) |
|
|
80
|
+
| Wrong region / currency shown | `localization` |
|
|
81
|
+
| Need a big report but cost is high | lower `maxUrls` on `deep_research` |
|
|
82
|
+
|
|
83
|
+
## Credits
|
|
84
|
+
|
|
85
|
+
Costs range from 1 credit (fetch_url, extract_text, scrape_template,
|
|
86
|
+
get_batch_results...) up to 10+ for `deep_research`. Full table:
|
|
87
|
+
[credits](references/credits.md). Prefer the cheapest tool that does the job, and
|
|
88
|
+
cap dynamic tools (`deep_research`, `agent`, `crawl_deep`) with their max
|
|
89
|
+
parameters.
|
|
@@ -0,0 +1,71 @@
|
|
|
1
|
+
# CrawlForge CLI — Subcommand Summary
|
|
2
|
+
|
|
3
|
+
The `crawlforge` CLI exposes the tools as command-line subcommands for scripts,
|
|
4
|
+
CI, and pipelines.
|
|
5
|
+
|
|
6
|
+
## Install & auth
|
|
7
|
+
|
|
8
|
+
```bash
|
|
9
|
+
npm install -g crawlforge-mcp-server # or: npx crawlforge-mcp-server <command>
|
|
10
|
+
export CRAWLFORGE_API_KEY="your_api_key_here" # or pass --api-key per command
|
|
11
|
+
crawlforge init # detect key, install skills, merge MCP config
|
|
12
|
+
```
|
|
13
|
+
|
|
14
|
+
## Global flags
|
|
15
|
+
|
|
16
|
+
| Flag | Description |
|
|
17
|
+
|------|-------------|
|
|
18
|
+
| `--json` | Compact JSON output (pipe-friendly). |
|
|
19
|
+
| `--pretty` | Pretty-printed JSON. |
|
|
20
|
+
| `--quiet` | Suppress stdout (exit code only). |
|
|
21
|
+
| `--api-key <key>` | Override `CRAWLFORGE_API_KEY`. |
|
|
22
|
+
| `--timeout <ms>` | Request timeout (default 30000). |
|
|
23
|
+
| `--version` / `--help` | Version / help. |
|
|
24
|
+
|
|
25
|
+
## Tool commands (15)
|
|
26
|
+
|
|
27
|
+
| Command | Maps to | Example |
|
|
28
|
+
|---------|---------|---------|
|
|
29
|
+
| `scrape <url>` | fetch_url / extract_content | `crawlforge scrape https://example.com --extract --format markdown` |
|
|
30
|
+
| `search <query>` | search_web | `crawlforge search "MCP tutorial" --limit 5` |
|
|
31
|
+
| `crawl <url>` | crawl_deep | `crawlforge crawl https://docs.example.com --depth 3 --max-pages 200` |
|
|
32
|
+
| `map <url>` | map_site | `crawlforge map https://example.com --format xml` |
|
|
33
|
+
| `extract <url>` | extract_structured / extract_with_llm | `crawlforge extract <url> --prompt "title, author" --model llama3.2` |
|
|
34
|
+
| `track <url>` | track_changes | `crawlforge track <url> --selector ".price" --threshold 1` |
|
|
35
|
+
| `analyze <url>` | analyze_content | `crawlforge analyze <url> --depth full` |
|
|
36
|
+
| `research <topic>` | deep_research | `crawlforge research "AI trends 2025" --depth deep --max-urls 30` |
|
|
37
|
+
| `stealth <url>` | stealth_mode | `crawlforge stealth <url> --engine camoufox --screenshot` |
|
|
38
|
+
| `batch <file>` | batch_scrape | `crawlforge batch urls.txt --format markdown --concurrency 10` |
|
|
39
|
+
| `actions <url>` | scrape_with_actions | `crawlforge actions <url> --script flow.json --screenshot` |
|
|
40
|
+
| `localize <url>` | localization | `crawlforge localize <url> --locale fr-FR --country FR` |
|
|
41
|
+
| `llmstxt <url>` | generate_llms_txt | `crawlforge llmstxt <url> --include-full` |
|
|
42
|
+
| `template <id> <target>` | scrape_template | `crawlforge template github-repo https://github.com/owner/repo` |
|
|
43
|
+
| `monitor <url>` | track_changes (scheduled) | `crawlforge monitor <url> --interval 60 --webhook <url>` |
|
|
44
|
+
|
|
45
|
+
## Skill commands (2)
|
|
46
|
+
|
|
47
|
+
| Command | Purpose |
|
|
48
|
+
|---------|---------|
|
|
49
|
+
| `install-skills --target <claude-code\|cursor\|vscode\|all>` | Install skill files into your AI tool. |
|
|
50
|
+
| `uninstall-skills --target <...>` | Remove them. |
|
|
51
|
+
|
|
52
|
+
## Output modes
|
|
53
|
+
|
|
54
|
+
```bash
|
|
55
|
+
crawlforge search "nodejs MCP" # human-readable
|
|
56
|
+
crawlforge research "AI 2025" --pretty > report.json # pretty JSON to file
|
|
57
|
+
crawlforge search "nodejs" --json | jq '.results[].url' # compact JSON to jq
|
|
58
|
+
crawlforge scrape https://example.com --quiet && echo ok # exit code only
|
|
59
|
+
```
|
|
60
|
+
|
|
61
|
+
## Relevant env vars
|
|
62
|
+
|
|
63
|
+
| Variable | Purpose |
|
|
64
|
+
|----------|---------|
|
|
65
|
+
| `CRAWLFORGE_API_KEY` | API key (required for production). |
|
|
66
|
+
| `OLLAMA_BASE_URL` | Ollama server (default http://localhost:11434). |
|
|
67
|
+
| `OLLAMA_DEFAULT_MODEL` | Default model for `extract`. |
|
|
68
|
+
| `CRAWLFORGE_STEALTH_ENGINE` | Force `playwright` or `camoufox`. |
|
|
69
|
+
| `CRAWLFORGE_BROWSER_BACKEND` | `local` or `browserbase`. |
|
|
70
|
+
|
|
71
|
+
Exit code 0 = success, 1 = error. `crawlforge <command> --help` for per-command help.
|
|
@@ -0,0 +1,75 @@
|
|
|
1
|
+
# CrawlForge Credit Costs
|
|
2
|
+
|
|
3
|
+
Per-tool credit cost (source of truth: `AuthManager.getToolCost`). Every tool is
|
|
4
|
+
metered; there is no free tier. Tools marked "scales" cost more as work grows.
|
|
5
|
+
|
|
6
|
+
## 1 credit
|
|
7
|
+
|
|
8
|
+
| Tool | Notes |
|
|
9
|
+
|------|-------|
|
|
10
|
+
| `fetch_url` | Raw HTTP fetch. |
|
|
11
|
+
| `extract_text` | Tag-stripped text/markdown. |
|
|
12
|
+
| `extract_links` | All links on a page. |
|
|
13
|
+
| `extract_metadata` | Title / meta / OG / schema.org. |
|
|
14
|
+
| `scrape_template` | Pre-built site extractor. |
|
|
15
|
+
| `list_ollama_models` | List local LLMs. |
|
|
16
|
+
| `get_batch_results` | Retrieve an already-paid batch. |
|
|
17
|
+
|
|
18
|
+
## 2 credits
|
|
19
|
+
|
|
20
|
+
| Tool | Notes |
|
|
21
|
+
|------|-------|
|
|
22
|
+
| `scrape` | Unified multi-format single fetch. |
|
|
23
|
+
| `scrape_structured` | CSS-selector extraction. |
|
|
24
|
+
| `extract_content` | Readability-cleaned article. |
|
|
25
|
+
| `map_site` | URL discovery / sitemap. |
|
|
26
|
+
| `process_document` | PDF / DOCX / TXT parsing. |
|
|
27
|
+
| `localization` | Locale / geo emulation. |
|
|
28
|
+
|
|
29
|
+
## 3 credits
|
|
30
|
+
|
|
31
|
+
| Tool | Notes |
|
|
32
|
+
|------|-------|
|
|
33
|
+
| `track_changes` | Baseline / compare / monitor. |
|
|
34
|
+
| `analyze_content` | Sentiment / entities / keywords. |
|
|
35
|
+
| `extract_structured` | Schema-driven (LLM + CSS fallback). |
|
|
36
|
+
| `extract_with_llm` | NL-prompt extraction. |
|
|
37
|
+
|
|
38
|
+
## 4 credits
|
|
39
|
+
|
|
40
|
+
| Tool | Notes |
|
|
41
|
+
|------|-------|
|
|
42
|
+
| `summarize_content` | Extractive / abstractive summary. |
|
|
43
|
+
| `crawl_deep` | Site crawl; **scales** with page count. |
|
|
44
|
+
|
|
45
|
+
## 5 credits
|
|
46
|
+
|
|
47
|
+
| Tool | Notes |
|
|
48
|
+
|------|-------|
|
|
49
|
+
| `stealth_mode` | Anti-bot browser (screenshots add a little). |
|
|
50
|
+
| `scrape_with_actions` | Browser automation then scrape. |
|
|
51
|
+
| `batch_scrape` | Many URLs; projection scales with URL count. |
|
|
52
|
+
| `search_web` | Web search. |
|
|
53
|
+
| `generate_llms_txt` | AI-compliance file. |
|
|
54
|
+
|
|
55
|
+
## 8 credits
|
|
56
|
+
|
|
57
|
+
| Tool | Notes |
|
|
58
|
+
|------|-------|
|
|
59
|
+
| `agent` | Autonomous research; **scales** with `maxUrls` (and `model:"pro"`). |
|
|
60
|
+
|
|
61
|
+
## 10+ credits
|
|
62
|
+
|
|
63
|
+
| Tool | Notes |
|
|
64
|
+
|------|-------|
|
|
65
|
+
| `deep_research` | Base 10; **scales** with `maxUrls`. **>50 URLs prompts for confirmation.** |
|
|
66
|
+
|
|
67
|
+
## Cost-control tips
|
|
68
|
+
|
|
69
|
+
- Use the cheapest tool that satisfies the request (e.g. `extract_metadata` at 1
|
|
70
|
+
before `scrape` at 2).
|
|
71
|
+
- Prefer one `scrape` call (2) over several single-format basic calls.
|
|
72
|
+
- Cap dynamic tools: `deep_research`/`agent` via `maxUrls`, `crawl_deep` via
|
|
73
|
+
`max_pages`, `batch_scrape` via the URL list size.
|
|
74
|
+
- `get_batch_results` (1) is cheap — submit a batch once, page through results.
|
|
75
|
+
- Errors are charged at half the tool cost; creator mode is unlimited.
|
|
@@ -0,0 +1,106 @@
|
|
|
1
|
+
---
|
|
2
|
+
name: crawlforge-stealth-browsing
|
|
3
|
+
description: "Bypasses bot detection and geo-restrictions with CrawlForge's stealth_mode and localization tools. Use when a site returns 403 or 429, CAPTCHAs, 'please enable JavaScript', or empty content, or is protected by Cloudflare, DataDome, or PerimeterX, or when the user needs region-specific pricing, geo-blocked content, or a specific locale, timezone, or currency. stealth_mode runs a stealth browser (playwright by default, camoufox for advanced fingerprinting) and can screenshot; localization emulates a country and language. Explains when to escalate from a normal scrape to stealth."
|
|
4
|
+
metadata:
|
|
5
|
+
version: 4.8.0
|
|
6
|
+
source: crawlforge-mcp-server
|
|
7
|
+
---
|
|
8
|
+
|
|
9
|
+
# CrawlForge Stealth Browsing
|
|
10
|
+
|
|
11
|
+
Get past bot-detection systems and geo-blocks. Use `stealth_mode` when a normal
|
|
12
|
+
scrape is blocked, and `localization` when you need region-specific content,
|
|
13
|
+
pricing, or locale emulation.
|
|
14
|
+
|
|
15
|
+
## When to escalate to stealth_mode
|
|
16
|
+
|
|
17
|
+
Escalate from a normal `scrape` / `fetch_url` (see crawlforge-web-scraping) when:
|
|
18
|
+
|
|
19
|
+
- The site returns **403 or 429** on a regular fetch.
|
|
20
|
+
- You get a **CAPTCHA** or a "please enable JavaScript" interstitial.
|
|
21
|
+
- Content comes back **empty** or only a shell (JS-rendered SPA).
|
|
22
|
+
- The site uses **Cloudflare, DataDome, PerimeterX**, or similar protection.
|
|
23
|
+
|
|
24
|
+
`stealth_mode` drives a real browser with randomized fingerprints, human
|
|
25
|
+
behavior simulation, and WebRTC/canvas/WebGL spoofing.
|
|
26
|
+
|
|
27
|
+
## stealth_mode (cost: 5)
|
|
28
|
+
|
|
29
|
+
`stealth_mode` is operation-based. Typical flow: create a context, then create a
|
|
30
|
+
page that navigates to the target URL.
|
|
31
|
+
|
|
32
|
+
```json
|
|
33
|
+
{
|
|
34
|
+
"tool": "stealth_mode",
|
|
35
|
+
"params": {
|
|
36
|
+
"operation": "create_context",
|
|
37
|
+
"stealthConfig": { "level": "advanced", "simulateHumanBehavior": true },
|
|
38
|
+
"engine": "playwright"
|
|
39
|
+
}
|
|
40
|
+
}
|
|
41
|
+
```
|
|
42
|
+
|
|
43
|
+
Then use the returned `contextId`:
|
|
44
|
+
|
|
45
|
+
```json
|
|
46
|
+
{
|
|
47
|
+
"tool": "stealth_mode",
|
|
48
|
+
"params": { "operation": "create_page", "contextId": "<id-from-create_context>", "urlToTest": "https://protected-site.com" }
|
|
49
|
+
}
|
|
50
|
+
```
|
|
51
|
+
|
|
52
|
+
Operations: `configure`, `enable`, `disable`, `create_context`, `create_page`,
|
|
53
|
+
`get_stats`, `cleanup`. `stealthConfig.level` is `basic` / `medium` (default) /
|
|
54
|
+
`advanced`. Always run `cleanup` when done to release the browser.
|
|
55
|
+
|
|
56
|
+
### Engine: playwright vs camoufox
|
|
57
|
+
|
|
58
|
+
- `engine:"playwright"` (default) — Chromium with stealth patches. Fast, good
|
|
59
|
+
for most basic bot detection.
|
|
60
|
+
- `engine:"camoufox"` — Firefox-based with native anti-detection (no patches).
|
|
61
|
+
Scores higher against DataDome / Cloudflare / PerimeterX and on CreepJS. Use
|
|
62
|
+
for heavily protected, financial, or e-commerce sites.
|
|
63
|
+
|
|
64
|
+
Full decision table: [engine selection](references/engine-selection.md).
|
|
65
|
+
|
|
66
|
+
### CLI
|
|
67
|
+
|
|
68
|
+
```bash
|
|
69
|
+
crawlforge stealth https://protected-site.com
|
|
70
|
+
crawlforge stealth https://protected-site.com --engine camoufox --wait 3000 --screenshot
|
|
71
|
+
```
|
|
72
|
+
|
|
73
|
+
The CLI exposes a one-shot form (`--engine`, `--wait <ms>`, `--screenshot`).
|
|
74
|
+
Force the engine globally with `export CRAWLFORGE_STEALTH_ENGINE=camoufox`.
|
|
75
|
+
|
|
76
|
+
## localization (cost: 2)
|
|
77
|
+
|
|
78
|
+
Emulate a country/language/timezone/currency for region-specific content and
|
|
79
|
+
geo-blocked pages.
|
|
80
|
+
|
|
81
|
+
```json
|
|
82
|
+
{
|
|
83
|
+
"tool": "localization",
|
|
84
|
+
"params": { "operation": "configure_country", "countryCode": "DE", "language": "de", "currency": "EUR" }
|
|
85
|
+
}
|
|
86
|
+
```
|
|
87
|
+
|
|
88
|
+
Operations: `configure_country`, `localize_search`, `localize_browser`,
|
|
89
|
+
`generate_timezone_spoof`, `handle_geo_blocking`, `auto_detect`, `get_stats`,
|
|
90
|
+
`get_supported_countries`. `countryCode` is ISO 3166-1 alpha-2; `currency` is
|
|
91
|
+
ISO 4217. Supports proxy routing and GPS geolocation emulation.
|
|
92
|
+
|
|
93
|
+
CLI: `crawlforge localize https://shop.example.com --locale en-GB --country GB --currency GBP`.
|
|
94
|
+
|
|
95
|
+
## stealth_mode vs localization
|
|
96
|
+
|
|
97
|
+
- **Blocked / bot-detected** → `stealth_mode`.
|
|
98
|
+
- **Wrong region / language / currency, but not blocked** → `localization`.
|
|
99
|
+
- **Geo-blocked AND bot-protected** → `localization` to set region context, then
|
|
100
|
+
`stealth_mode` for the actual fetch.
|
|
101
|
+
|
|
102
|
+
## Cost note
|
|
103
|
+
|
|
104
|
+
`stealth_mode` = 5 credits per call (screenshots add a small extra cost);
|
|
105
|
+
`localization` = 2 credits. Try the cheaper `scrape` (2 credits) first and only
|
|
106
|
+
escalate to stealth when you see the block signals above.
|