crawlforge-mcp-server 4.7.1 → 4.8.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
Files changed (53) hide show
  1. package/CLAUDE.md +2 -2
  2. package/package.json +2 -1
  3. package/server.js +56 -10
  4. package/src/cli/commands/init.js +13 -2
  5. package/src/cli/commands/install-skills.js +10 -1
  6. package/src/cli/commands/monitor.js +81 -0
  7. package/src/cli/commands/uninstall-skills.js +10 -1
  8. package/src/core/ActionExecutor.js +81 -15
  9. package/src/core/ElicitationHelper.js +18 -5
  10. package/src/core/LLMsTxtAnalyzer.js +2 -1
  11. package/src/core/MonitorScheduler.js +281 -0
  12. package/src/core/MonitorStore.js +79 -0
  13. package/src/core/ResearchOrchestrator.js +2 -1
  14. package/src/core/crawlers/BFSCrawler.js +2 -1
  15. package/src/resources/ResourceRegistry.js +3 -0
  16. package/src/skills/agent-skills/crawlforge-batch-automation/SKILL.md +126 -0
  17. package/src/skills/agent-skills/crawlforge-batch-automation/references/actions.md +127 -0
  18. package/src/skills/agent-skills/crawlforge-change-tracking/SKILL.md +116 -0
  19. package/src/skills/agent-skills/crawlforge-deep-research/SKILL.md +108 -0
  20. package/src/skills/agent-skills/crawlforge-deep-research/references/workflows.md +76 -0
  21. package/src/skills/agent-skills/crawlforge-getting-started/SKILL.md +89 -0
  22. package/src/skills/agent-skills/crawlforge-getting-started/references/cli.md +71 -0
  23. package/src/skills/agent-skills/crawlforge-getting-started/references/credits.md +75 -0
  24. package/src/skills/agent-skills/crawlforge-stealth-browsing/SKILL.md +106 -0
  25. package/src/skills/agent-skills/crawlforge-stealth-browsing/references/engine-selection.md +63 -0
  26. package/src/skills/agent-skills/crawlforge-structured-extraction/SKILL.md +121 -0
  27. package/src/skills/agent-skills/crawlforge-structured-extraction/references/templates.md +39 -0
  28. package/src/skills/agent-skills/crawlforge-web-scraping/SKILL.md +141 -0
  29. package/src/skills/agent-skills/crawlforge-web-scraping/references/tool-reference.md +95 -0
  30. package/src/skills/installer.js +186 -34
  31. package/src/tools/advanced/ScrapeWithActionsTool.js +7 -0
  32. package/src/tools/advanced/batchScrape/worker.js +8 -2
  33. package/src/tools/basic/_fetch.js +14 -1
  34. package/src/tools/crawl/_sessionContext.js +3 -1
  35. package/src/tools/extract/_fetchAndParse.js +2 -1
  36. package/src/tools/extract/extractContent.js +2 -1
  37. package/src/tools/extract/extractStructured.js +43 -0
  38. package/src/tools/extract/processDocument.js +2 -1
  39. package/src/tools/scrape/_brandingExtractor.js +378 -0
  40. package/src/tools/scrape/unifiedScrape.js +66 -6
  41. package/src/tools/templates/ScrapeTemplateTool.js +2 -1
  42. package/src/tools/tracking/trackChanges/differ.js +3 -1
  43. package/src/tools/tracking/trackChanges/index.js +74 -21
  44. package/src/tools/tracking/trackChanges/schema.js +7 -2
  45. package/src/utils/hostRateLimiter.js +46 -0
  46. package/src/utils/robotsChecker.js +2 -1
  47. package/src/utils/sitemapParser.js +2 -1
  48. package/src/utils/ssrfGuard.js +161 -0
  49. package/src/utils/ssrfProtection.js +6 -9
  50. package/src/skills/crawlforge-cli.md +0 -157
  51. package/src/skills/crawlforge-mcp.md +0 -80
  52. package/src/skills/crawlforge-research.md +0 -104
  53. package/src/skills/crawlforge-stealth.md +0 -98
@@ -0,0 +1,116 @@
1
+ ---
2
+ name: crawlforge-change-tracking
3
+ description: "Monitors web pages for changes over time with CrawlForge's track_changes tool. Use when the user wants to track changes to a page, watch a URL, monitor competitor pricing, detect when content updates, get notified of regulation or product-availability changes, or diff a page against a saved baseline. Workflow: create a baseline with operation create_baseline, then periodically compare with operation compare to get a change percentage and a diff; supports CSS-selector scoping, webhooks, and scheduled monitoring."
4
+ metadata:
5
+ version: 4.8.0
6
+ source: crawlforge-mcp-server
7
+ ---
8
+
9
+ # CrawlForge Change Tracking
10
+
11
+ Detect when a web page changes over time using the `track_changes` tool. Useful
12
+ for competitor pricing, regulation updates, product availability, and any page
13
+ you need to watch for edits.
14
+
15
+ ## When to use
16
+
17
+ - "Track changes to this page" / "watch this URL"
18
+ - "Tell me when competitor pricing changes"
19
+ - "Detect when this content updates"
20
+ - "Diff this page against last week's version"
21
+ - "Notify me when product availability / a regulation changes"
22
+
23
+ ## Core workflow
24
+
25
+ `track_changes` is one tool driven by an `operation` parameter (cost: 3 credits
26
+ per call).
27
+
28
+ 1. **Create a baseline** the first time:
29
+
30
+ ```json
31
+ { "tool": "track_changes", "params": { "url": "https://example.com/pricing", "operation": "create_baseline" } }
32
+ ```
33
+
34
+ 2. **Compare** later to get the change percentage + diff:
35
+
36
+ ```json
37
+ { "tool": "track_changes", "params": { "url": "https://example.com/pricing", "operation": "compare" } }
38
+ ```
39
+
40
+ `compare` is the default operation. It returns a change percentage and a
41
+ structured diff against the stored baseline.
42
+
43
+ ## Scoping to part of a page
44
+
45
+ Use `trackingOptions` to ignore noise and focus on what matters:
46
+
47
+ ```json
48
+ {
49
+ "tool": "track_changes",
50
+ "params": {
51
+ "url": "https://example.com/pricing",
52
+ "operation": "compare",
53
+ "trackingOptions": {
54
+ "granularity": "element",
55
+ "customSelectors": [".price", ".plan-name"],
56
+ "excludeSelectors": [".timestamp", ".ad"],
57
+ "ignoreWhitespace": true
58
+ }
59
+ }
60
+ }
61
+ ```
62
+
63
+ `granularity`: `page`, `section` (default), `element`, `text`. Toggle
64
+ `trackText`, `trackStructure`, `trackLinks`, `trackImages`. Set
65
+ `significanceThresholds` (`minor`/`moderate`/`major`) to classify change size.
66
+
67
+ CLI: `crawlforge track https://example.com --selector ".price" --threshold 1`.
68
+
69
+ ## Scheduled monitoring & notifications
70
+
71
+ Run continuous monitoring with webhooks instead of polling manually:
72
+
73
+ ```json
74
+ {
75
+ "tool": "track_changes",
76
+ "params": {
77
+ "url": "https://example.com/pricing",
78
+ "operation": "create_scheduled_monitor",
79
+ "monitoringOptions": {
80
+ "enabled": true,
81
+ "interval": 300000,
82
+ "notificationThreshold": "moderate",
83
+ "enableWebhook": true,
84
+ "webhookUrl": "https://my-site.com/notify"
85
+ }
86
+ }
87
+ }
88
+ ```
89
+
90
+ CLI (runs until Ctrl+C):
91
+ `crawlforge monitor https://example.com --interval 60 --webhook https://my-site.com/notify`.
92
+
93
+ ## Other operations
94
+
95
+ | Operation | Purpose |
96
+ |-----------|---------|
97
+ | `create_baseline` | Save the first snapshot to diff against. |
98
+ | `compare` (default) | Diff current content vs baseline → % change + diff. |
99
+ | `monitor` | One monitoring pass. |
100
+ | `get_history` | Retrieve past change records (`queryOptions`). |
101
+ | `get_stats` | Summary statistics for a tracked URL. |
102
+ | `create_scheduled_monitor` / `stop_scheduled_monitor` | Manage cron-style monitors. |
103
+ | `get_dashboard` | Aggregate status, recent alerts, trends. |
104
+ | `export_history` | Export change history as `json` or `csv`. |
105
+ | `create_alert_rule` | Conditional alerts (webhook / email / slack). |
106
+ | `generate_trend_report` | Trend analysis over time. |
107
+ | `get_monitoring_templates` | List built-in monitoring presets. |
108
+
109
+ You can also pass `content` or `html` directly to compare pre-fetched content
110
+ against the baseline without re-fetching.
111
+
112
+ ## Cost note
113
+
114
+ `track_changes` = 3 credits per call. A typical watch is one `create_baseline`
115
+ plus periodic `compare` calls; scheduled monitors run server-side and notify via
116
+ webhook/Slack so you don't pay for manual polling loops.
@@ -0,0 +1,108 @@
1
+ ---
2
+ name: crawlforge-deep-research
3
+ description: "Runs multi-source web research and autonomous question-answering with CrawlForge's deep_research, agent, and search_web tools. Use when the user wants to research a topic, do a deep dive, compare competitors, gather facts with citations, answer a question from the web, search the web, or get a synthesized report from many sources. deep_research expands queries, fetches and dedupes sources, then synthesizes; agent autonomously plans searches and navigation from a plain-English prompt with no URLs needed; search_web returns ranked results. Caps costs via max_urls and confirms before expensive runs."
4
+ metadata:
5
+ version: 4.8.0
6
+ source: crawlforge-mcp-server
7
+ ---
8
+
9
+ # CrawlForge Deep Research
10
+
11
+ Answer questions and produce reports from many web sources using the CrawlForge
12
+ MCP server. Three tools, from lightest to heaviest: `search_web` (ranked
13
+ results), `agent` (autonomous plan-and-answer), `deep_research` (exhaustive
14
+ multi-source synthesis).
15
+
16
+ ## When to use
17
+
18
+ - "Search the web for X" / "find pages about X" → `search_web`
19
+ - "Answer this question from the web" / "what are the top 5 X" (no URLs given) → `agent`
20
+ - "Research this topic in depth" / "compare competitors" / "give me a cited
21
+ report" / "gather facts with sources" → `deep_research`
22
+
23
+ Do NOT use these for a single known URL — use the **crawlforge-web-scraping**
24
+ skill (`scrape` / `extract_content`) instead.
25
+
26
+ ## Tool selection
27
+
28
+ | Need | Tool | Cost |
29
+ |------|------|------|
30
+ | A ranked list of result URLs + snippets | `search_web` | 5 |
31
+ | A direct answer, agent decides what to read | `agent` | 8 (scales) |
32
+ | A synthesized, multi-source, cited report | `deep_research` | 10+ (scales) |
33
+
34
+ A common pipeline: `search_web` to find sources → `batch_scrape` (see
35
+ **crawlforge-batch-automation**) to read them → `deep_research` to synthesize.
36
+
37
+ ## search_web (cost: 5)
38
+
39
+ ```json
40
+ {
41
+ "tool": "search_web",
42
+ "params": { "query": "best MCP servers 2025", "limit": 10, "time_range": "month" }
43
+ }
44
+ ```
45
+
46
+ Returns titles, URLs, snippets. Supports `lang`, `site` (domain filter),
47
+ `file_type`, `time_range` (`day`/`week`/`month`/`year`/`all`), `expand_query`,
48
+ `enable_ranking`, and `enable_deduplication`. CLI:
49
+ `crawlforge search "MCP server tutorial" --limit 5`.
50
+
51
+ ## agent — autonomous answer (cost: 8, scales with maxUrls)
52
+
53
+ No URLs required. The agent plans search queries, fetches and filters relevant
54
+ pages, then returns a prose or structured answer.
55
+
56
+ ```json
57
+ {
58
+ "tool": "agent",
59
+ "params": { "prompt": "What are the top 5 MCP servers in 2025 and who maintains them?", "maxUrls": 10 }
60
+ }
61
+ ```
62
+
63
+ - `model:"default"` runs a SamplingClient loop (no LLM keys needed).
64
+ - `model:"pro"` runs the full ResearchOrchestrator (deeper, multi-source); it
65
+ asks for confirmation (elicitation) before running.
66
+ - Pass `schema` for structured JSON output, `urls` for optional seed URLs.
67
+ - Hard limits enforced by the orchestrator: `maxSteps` ≤ 10, `maxUrls` ≤ 20,
68
+ 120s wall-clock. Output is degraded-but-useful if no LLM keys/Ollama exist.
69
+
70
+ ## deep_research — exhaustive report (cost: 10+, scales)
71
+
72
+ ```json
73
+ {
74
+ "tool": "deep_research",
75
+ "params": {
76
+ "topic": "React vs Vue vs Angular in 2025",
77
+ "maxUrls": 20,
78
+ "researchApproach": "comparative",
79
+ "outputFormat": "comprehensive"
80
+ }
81
+ }
82
+ ```
83
+
84
+ How it works: query expansion → parallel fetch (deduped) → content extraction →
85
+ conflict detection → synthesis. `researchApproach`: `broad`, `focused`,
86
+ `academic`, `current_events`, `comparative`. `outputFormat`: `comprehensive`,
87
+ `summary`, `citations_only`, `conflicts_focus`. Results are stored as a
88
+ `crawlforge://research/{sessionId}` resource — re-read them without re-running.
89
+
90
+ CLI: `crawlforge research "B2B SaaS pricing trends" --depth deep --max-urls 30`.
91
+
92
+ ### Cost control
93
+
94
+ `deep_research` starts at 10 credits and grows with sources. **`maxUrls > 50`
95
+ triggers a confirmation prompt (elicitation).** Cap cost with `maxUrls`
96
+ (e.g. `maxUrls:10` ≈ 12 credits). `timeLimit` (default 120000 ms) bounds
97
+ runtime.
98
+
99
+ ## Synthesis & LLM fallback
100
+
101
+ The server ships with **no LLM key by design**. The fallback chain is:
102
+ Ollama (local) → OpenAI key → Anthropic key → MCP Sampling → raw evidence.
103
+ With no LLM configured, `deep_research` returns structured **raw evidence** for
104
+ the calling assistant (e.g. Claude Code) to synthesize — this is expected, do
105
+ not suggest adding API keys.
106
+
107
+ See [research workflows](references/workflows.md) for pipelines, depth tiers,
108
+ and parameter detail.
@@ -0,0 +1,76 @@
1
+ # Deep Research — Workflows & Parameters
2
+
3
+ ## Pipelines
4
+
5
+ ### Targeted competitive research
6
+ 1. `search_web` `"competitor X pricing"` → candidate URLs.
7
+ 2. `batch_scrape` the URLs → content in parallel (see crawlforge-batch-automation).
8
+ 3. `deep_research` `"competitor X vs us"` → synthesized comparison.
9
+
10
+ ### Quick fact-find
11
+ - `agent` with `model:"default"` and a tight `maxUrls` (5–8). Cheapest path to a
12
+ cited answer when you don't need a full report.
13
+
14
+ ### Exhaustive report
15
+ - `deep_research` with `researchApproach` matched to intent and `maxUrls`
16
+ sized to budget.
17
+
18
+ ## deep_research parameters (cost: 10+, scales with maxUrls)
19
+
20
+ | Param | Type | Default | Notes |
21
+ |-------|------|---------|-------|
22
+ | `topic` | string | — | Required, 3–500 chars. |
23
+ | `maxDepth` | number | `5` | 1–10. |
24
+ | `maxUrls` | number | `50` | 1–1000. **>50 triggers elicitation.** |
25
+ | `timeLimit` | number | `120000` | 30000–300000 ms. |
26
+ | `researchApproach` | enum | `broad` | `broad`, `focused`, `academic`, `current_events`, `comparative`. |
27
+ | `sourceTypes` | string[] | `["any"]` | `academic`, `news`, `government`, `commercial`, `blog`, `wiki`, `any`. |
28
+ | `credibilityThreshold` | number | `0.3` | 0–1 minimum source credibility. |
29
+ | `enableConflictDetection` | boolean | `true` | Flag contradictory claims. |
30
+ | `enableSourceVerification` | boolean | `true` | Score source credibility. |
31
+ | `enableSynthesis` | boolean | `true` | Build a coherent report. |
32
+ | `outputFormat` | enum | `comprehensive` | `comprehensive`, `summary`, `citations_only`, `conflicts_focus`. |
33
+ | `includeRawData` | boolean | `false` | Include raw scraped data. |
34
+ | `concurrency` | number | `5` | 1–20. |
35
+ | `webhook` | object | — | `started`/`progress`/`completed`/`failed` callbacks. |
36
+
37
+ Results stored at `crawlforge://research/{sessionId}` (list via `resources/list`).
38
+
39
+ ## Approximate depth tiers (CLI `--depth`)
40
+
41
+ | Depth | URLs | Use case | Approx. credits |
42
+ |-------|------|----------|-----------------|
43
+ | basic | 5–10 | Quick overview | 10–15 |
44
+ | standard | 15–25 | General research | 15–30 |
45
+ | deep | 30–75 | Comprehensive analysis | 30–75+ |
46
+
47
+ ## agent parameters (cost: 8, scales with maxUrls)
48
+
49
+ | Param | Type | Default | Notes |
50
+ |-------|------|---------|-------|
51
+ | `prompt` | string | — | Required, 1–2000 chars. |
52
+ | `urls` | string[] | — | Optional seed URLs, max 20. |
53
+ | `schema` | object | — | JSON schema for structured output. |
54
+ | `model` | enum | `default` | `default` = SamplingClient loop; `pro` = ResearchOrchestrator (confirms first). |
55
+ | `maxSteps` | number | `5` | Hard cap 10. |
56
+ | `maxUrls` | number | `10` | Hard cap 20. |
57
+
58
+ Orchestrator-enforced hard stops (never delegated to the LLM): maxSteps ≤ 10,
59
+ maxUrls ≤ 20, 120s wall-clock.
60
+
61
+ ## search_web parameters (cost: 5)
62
+
63
+ | Param | Type | Notes |
64
+ |-------|------|-------|
65
+ | `query` | string | Required. |
66
+ | `limit` | number | 1–100. |
67
+ | `offset` | number | Pagination. |
68
+ | `lang` | string | e.g. `en`, `fr`. |
69
+ | `time_range` | enum | `day`, `week`, `month`, `year`, `all`. |
70
+ | `site` | string | Restrict to a domain. |
71
+ | `file_type` | string | e.g. `pdf`, `doc`. |
72
+ | `provider` | enum | `crawlforge` (default) or `searxng`. |
73
+ | `expand_query` | boolean | Synonyms/stemming/etc. |
74
+ | `enable_ranking` | boolean | BM25 + signal re-ranking. |
75
+ | `enable_deduplication` | boolean | Drop near-duplicates. |
76
+ | `localization` | object | `countryCode`, `language`, `timezone`, geo-targeting. |
@@ -0,0 +1,89 @@
1
+ ---
2
+ name: crawlforge-getting-started
3
+ description: "Orientation and tool-selection guide for the CrawlForge MCP server's 26 web tools. Use when the user is getting started with CrawlForge, asks which CrawlForge tool to use, how to set up the API key, how skills or the CLI work, what a tool costs in credits, or when one tool fails and a fallback is needed. Routes requests to the right specialized skill (web scraping, deep research, stealth, structured extraction, change tracking, batch automation), and explains MCP-tools-vs-CLI, the Ollama-first LLM fallback chain, and per-tool credit costs."
4
+ metadata:
5
+ version: 4.8.0
6
+ source: crawlforge-mcp-server
7
+ ---
8
+
9
+ # CrawlForge: Getting Started
10
+
11
+ CrawlForge is an MCP server with **26 tools** for web scraping, crawling,
12
+ extraction, research, change tracking, and AI-compliance. This skill orients you
13
+ and routes each request to the right specialized skill.
14
+
15
+ ## Setup
16
+
17
+ 1. Get an API key at https://crawlforge.dev/signup (1,000 free credits).
18
+ 2. Provide it to the server:
19
+
20
+ ```bash
21
+ export CRAWLFORGE_API_KEY="your_api_key_here"
22
+ # or, for the CLI, run: crawlforge init (detects key, installs skills, merges MCP config)
23
+ ```
24
+
25
+ Every tool is metered and requires a key — there is no free tier. User config is
26
+ stored at `~/.crawlforge/config.json`.
27
+
28
+ ## Route to the right skill
29
+
30
+ | The user wants to... | Use skill |
31
+ |----------------------|-----------|
32
+ | Scrape a page, get markdown/text/links/metadata, map or crawl a site | **crawlforge-web-scraping** |
33
+ | Research a topic, search the web, get a cited report, autonomous Q&A | **crawlforge-deep-research** |
34
+ | Get past 403/CAPTCHA/Cloudflare, or emulate a region/locale | **crawlforge-stealth-browsing** |
35
+ | Extract JSON/fields, parse a PDF, summarize, analyze sentiment | **crawlforge-structured-extraction** |
36
+ | Watch a page for changes / monitor pricing | **crawlforge-change-tracking** |
37
+ | Scrape many URLs, run browser actions, generate llms.txt | **crawlforge-batch-automation** |
38
+
39
+ ## The 26 tools at a glance
40
+
41
+ - **Basic (5):** fetch_url, extract_text, extract_links, extract_metadata, scrape_structured
42
+ - **Unified (1):** scrape (multi-format single fetch)
43
+ - **Search & research (3):** search_web, deep_research, agent
44
+ - **Crawl (2):** crawl_deep, map_site
45
+ - **Extract & analyze (7):** extract_content, process_document, summarize_content, analyze_content, extract_structured, extract_with_llm, list_ollama_models
46
+ - **Batch & automation (4):** batch_scrape, get_batch_results, scrape_with_actions, generate_llms_txt
47
+ - **Stealth & locale (2):** stealth_mode, localization
48
+ - **Templates & tracking (2):** scrape_template, track_changes
49
+
50
+ ## MCP tools vs CLI
51
+
52
+ - **MCP tools** — call inline within an AI assistant session (Claude Code,
53
+ Cursor, etc.). This is the default in chat.
54
+ - **CLI** (`crawlforge <command>`) — for scripts, CI, and pipelines. 15 tool
55
+ commands + 2 skill commands. See [cli](references/cli.md).
56
+
57
+ Both hit the same backend and consume the same credits.
58
+
59
+ ## LLM fallback chain (Ollama-first)
60
+
61
+ LLM-backed tools (`extract_with_llm`, `extract_structured`, abstractive
62
+ `summarize_content`, `deep_research`, `agent`) resolve a provider in this order:
63
+
64
+ ```
65
+ Ollama (local, default, no key) → OpenAI key → Anthropic key → MCP Sampling → raw evidence
66
+ ```
67
+
68
+ The server ships with **no cloud LLM key by design**. With none configured,
69
+ research tools return raw evidence for the calling assistant to synthesize.
70
+ Do not suggest adding API keys — local Ollama is the intended zero-cost default.
71
+
72
+ ## When a tool fails — fallbacks
73
+
74
+ | Symptom | Try next |
75
+ |---------|----------|
76
+ | `scrape` / `fetch_url` returns 403, 429, CAPTCHA, or empty JS shell | `stealth_mode` (crawlforge-stealth-browsing) |
77
+ | No template for a known site | `scrape_structured` → `extract_structured` → `extract_with_llm` |
78
+ | LLM extraction unavailable (no Ollama/keys) | `scrape_structured` with CSS selectors |
79
+ | Single page too slow / many pages | `batch_scrape` (async + webhook) |
80
+ | Wrong region / currency shown | `localization` |
81
+ | Need a big report but cost is high | lower `maxUrls` on `deep_research` |
82
+
83
+ ## Credits
84
+
85
+ Costs range from 1 credit (fetch_url, extract_text, scrape_template,
86
+ get_batch_results...) up to 10+ for `deep_research`. Full table:
87
+ [credits](references/credits.md). Prefer the cheapest tool that does the job, and
88
+ cap dynamic tools (`deep_research`, `agent`, `crawl_deep`) with their max
89
+ parameters.
@@ -0,0 +1,71 @@
1
+ # CrawlForge CLI — Subcommand Summary
2
+
3
+ The `crawlforge` CLI exposes the tools as command-line subcommands for scripts,
4
+ CI, and pipelines.
5
+
6
+ ## Install & auth
7
+
8
+ ```bash
9
+ npm install -g crawlforge-mcp-server # or: npx crawlforge-mcp-server <command>
10
+ export CRAWLFORGE_API_KEY="your_api_key_here" # or pass --api-key per command
11
+ crawlforge init # detect key, install skills, merge MCP config
12
+ ```
13
+
14
+ ## Global flags
15
+
16
+ | Flag | Description |
17
+ |------|-------------|
18
+ | `--json` | Compact JSON output (pipe-friendly). |
19
+ | `--pretty` | Pretty-printed JSON. |
20
+ | `--quiet` | Suppress stdout (exit code only). |
21
+ | `--api-key <key>` | Override `CRAWLFORGE_API_KEY`. |
22
+ | `--timeout <ms>` | Request timeout (default 30000). |
23
+ | `--version` / `--help` | Version / help. |
24
+
25
+ ## Tool commands (15)
26
+
27
+ | Command | Maps to | Example |
28
+ |---------|---------|---------|
29
+ | `scrape <url>` | fetch_url / extract_content | `crawlforge scrape https://example.com --extract --format markdown` |
30
+ | `search <query>` | search_web | `crawlforge search "MCP tutorial" --limit 5` |
31
+ | `crawl <url>` | crawl_deep | `crawlforge crawl https://docs.example.com --depth 3 --max-pages 200` |
32
+ | `map <url>` | map_site | `crawlforge map https://example.com --format xml` |
33
+ | `extract <url>` | extract_structured / extract_with_llm | `crawlforge extract <url> --prompt "title, author" --model llama3.2` |
34
+ | `track <url>` | track_changes | `crawlforge track <url> --selector ".price" --threshold 1` |
35
+ | `analyze <url>` | analyze_content | `crawlforge analyze <url> --depth full` |
36
+ | `research <topic>` | deep_research | `crawlforge research "AI trends 2025" --depth deep --max-urls 30` |
37
+ | `stealth <url>` | stealth_mode | `crawlforge stealth <url> --engine camoufox --screenshot` |
38
+ | `batch <file>` | batch_scrape | `crawlforge batch urls.txt --format markdown --concurrency 10` |
39
+ | `actions <url>` | scrape_with_actions | `crawlforge actions <url> --script flow.json --screenshot` |
40
+ | `localize <url>` | localization | `crawlforge localize <url> --locale fr-FR --country FR` |
41
+ | `llmstxt <url>` | generate_llms_txt | `crawlforge llmstxt <url> --include-full` |
42
+ | `template <id> <target>` | scrape_template | `crawlforge template github-repo https://github.com/owner/repo` |
43
+ | `monitor <url>` | track_changes (scheduled) | `crawlforge monitor <url> --interval 60 --webhook <url>` |
44
+
45
+ ## Skill commands (2)
46
+
47
+ | Command | Purpose |
48
+ |---------|---------|
49
+ | `install-skills --target <claude-code\|cursor\|vscode\|all>` | Install skill files into your AI tool. |
50
+ | `uninstall-skills --target <...>` | Remove them. |
51
+
52
+ ## Output modes
53
+
54
+ ```bash
55
+ crawlforge search "nodejs MCP" # human-readable
56
+ crawlforge research "AI 2025" --pretty > report.json # pretty JSON to file
57
+ crawlforge search "nodejs" --json | jq '.results[].url' # compact JSON to jq
58
+ crawlforge scrape https://example.com --quiet && echo ok # exit code only
59
+ ```
60
+
61
+ ## Relevant env vars
62
+
63
+ | Variable | Purpose |
64
+ |----------|---------|
65
+ | `CRAWLFORGE_API_KEY` | API key (required for production). |
66
+ | `OLLAMA_BASE_URL` | Ollama server (default http://localhost:11434). |
67
+ | `OLLAMA_DEFAULT_MODEL` | Default model for `extract`. |
68
+ | `CRAWLFORGE_STEALTH_ENGINE` | Force `playwright` or `camoufox`. |
69
+ | `CRAWLFORGE_BROWSER_BACKEND` | `local` or `browserbase`. |
70
+
71
+ Exit code 0 = success, 1 = error. `crawlforge <command> --help` for per-command help.
@@ -0,0 +1,75 @@
1
+ # CrawlForge Credit Costs
2
+
3
+ Per-tool credit cost (source of truth: `AuthManager.getToolCost`). Every tool is
4
+ metered; there is no free tier. Tools marked "scales" cost more as work grows.
5
+
6
+ ## 1 credit
7
+
8
+ | Tool | Notes |
9
+ |------|-------|
10
+ | `fetch_url` | Raw HTTP fetch. |
11
+ | `extract_text` | Tag-stripped text/markdown. |
12
+ | `extract_links` | All links on a page. |
13
+ | `extract_metadata` | Title / meta / OG / schema.org. |
14
+ | `scrape_template` | Pre-built site extractor. |
15
+ | `list_ollama_models` | List local LLMs. |
16
+ | `get_batch_results` | Retrieve an already-paid batch. |
17
+
18
+ ## 2 credits
19
+
20
+ | Tool | Notes |
21
+ |------|-------|
22
+ | `scrape` | Unified multi-format single fetch. |
23
+ | `scrape_structured` | CSS-selector extraction. |
24
+ | `extract_content` | Readability-cleaned article. |
25
+ | `map_site` | URL discovery / sitemap. |
26
+ | `process_document` | PDF / DOCX / TXT parsing. |
27
+ | `localization` | Locale / geo emulation. |
28
+
29
+ ## 3 credits
30
+
31
+ | Tool | Notes |
32
+ |------|-------|
33
+ | `track_changes` | Baseline / compare / monitor. |
34
+ | `analyze_content` | Sentiment / entities / keywords. |
35
+ | `extract_structured` | Schema-driven (LLM + CSS fallback). |
36
+ | `extract_with_llm` | NL-prompt extraction. |
37
+
38
+ ## 4 credits
39
+
40
+ | Tool | Notes |
41
+ |------|-------|
42
+ | `summarize_content` | Extractive / abstractive summary. |
43
+ | `crawl_deep` | Site crawl; **scales** with page count. |
44
+
45
+ ## 5 credits
46
+
47
+ | Tool | Notes |
48
+ |------|-------|
49
+ | `stealth_mode` | Anti-bot browser (screenshots add a little). |
50
+ | `scrape_with_actions` | Browser automation then scrape. |
51
+ | `batch_scrape` | Many URLs; projection scales with URL count. |
52
+ | `search_web` | Web search. |
53
+ | `generate_llms_txt` | AI-compliance file. |
54
+
55
+ ## 8 credits
56
+
57
+ | Tool | Notes |
58
+ |------|-------|
59
+ | `agent` | Autonomous research; **scales** with `maxUrls` (and `model:"pro"`). |
60
+
61
+ ## 10+ credits
62
+
63
+ | Tool | Notes |
64
+ |------|-------|
65
+ | `deep_research` | Base 10; **scales** with `maxUrls`. **>50 URLs prompts for confirmation.** |
66
+
67
+ ## Cost-control tips
68
+
69
+ - Use the cheapest tool that satisfies the request (e.g. `extract_metadata` at 1
70
+ before `scrape` at 2).
71
+ - Prefer one `scrape` call (2) over several single-format basic calls.
72
+ - Cap dynamic tools: `deep_research`/`agent` via `maxUrls`, `crawl_deep` via
73
+ `max_pages`, `batch_scrape` via the URL list size.
74
+ - `get_batch_results` (1) is cheap — submit a batch once, page through results.
75
+ - Errors are charged at half the tool cost; creator mode is unlimited.
@@ -0,0 +1,106 @@
1
+ ---
2
+ name: crawlforge-stealth-browsing
3
+ description: "Bypasses bot detection and geo-restrictions with CrawlForge's stealth_mode and localization tools. Use when a site returns 403 or 429, CAPTCHAs, 'please enable JavaScript', or empty content, or is protected by Cloudflare, DataDome, or PerimeterX, or when the user needs region-specific pricing, geo-blocked content, or a specific locale, timezone, or currency. stealth_mode runs a stealth browser (playwright by default, camoufox for advanced fingerprinting) and can screenshot; localization emulates a country and language. Explains when to escalate from a normal scrape to stealth."
4
+ metadata:
5
+ version: 4.8.0
6
+ source: crawlforge-mcp-server
7
+ ---
8
+
9
+ # CrawlForge Stealth Browsing
10
+
11
+ Get past bot-detection systems and geo-blocks. Use `stealth_mode` when a normal
12
+ scrape is blocked, and `localization` when you need region-specific content,
13
+ pricing, or locale emulation.
14
+
15
+ ## When to escalate to stealth_mode
16
+
17
+ Escalate from a normal `scrape` / `fetch_url` (see crawlforge-web-scraping) when:
18
+
19
+ - The site returns **403 or 429** on a regular fetch.
20
+ - You get a **CAPTCHA** or a "please enable JavaScript" interstitial.
21
+ - Content comes back **empty** or only a shell (JS-rendered SPA).
22
+ - The site uses **Cloudflare, DataDome, PerimeterX**, or similar protection.
23
+
24
+ `stealth_mode` drives a real browser with randomized fingerprints, human
25
+ behavior simulation, and WebRTC/canvas/WebGL spoofing.
26
+
27
+ ## stealth_mode (cost: 5)
28
+
29
+ `stealth_mode` is operation-based. Typical flow: create a context, then create a
30
+ page that navigates to the target URL.
31
+
32
+ ```json
33
+ {
34
+ "tool": "stealth_mode",
35
+ "params": {
36
+ "operation": "create_context",
37
+ "stealthConfig": { "level": "advanced", "simulateHumanBehavior": true },
38
+ "engine": "playwright"
39
+ }
40
+ }
41
+ ```
42
+
43
+ Then use the returned `contextId`:
44
+
45
+ ```json
46
+ {
47
+ "tool": "stealth_mode",
48
+ "params": { "operation": "create_page", "contextId": "<id-from-create_context>", "urlToTest": "https://protected-site.com" }
49
+ }
50
+ ```
51
+
52
+ Operations: `configure`, `enable`, `disable`, `create_context`, `create_page`,
53
+ `get_stats`, `cleanup`. `stealthConfig.level` is `basic` / `medium` (default) /
54
+ `advanced`. Always run `cleanup` when done to release the browser.
55
+
56
+ ### Engine: playwright vs camoufox
57
+
58
+ - `engine:"playwright"` (default) — Chromium with stealth patches. Fast, good
59
+ for most basic bot detection.
60
+ - `engine:"camoufox"` — Firefox-based with native anti-detection (no patches).
61
+ Scores higher against DataDome / Cloudflare / PerimeterX and on CreepJS. Use
62
+ for heavily protected, financial, or e-commerce sites.
63
+
64
+ Full decision table: [engine selection](references/engine-selection.md).
65
+
66
+ ### CLI
67
+
68
+ ```bash
69
+ crawlforge stealth https://protected-site.com
70
+ crawlforge stealth https://protected-site.com --engine camoufox --wait 3000 --screenshot
71
+ ```
72
+
73
+ The CLI exposes a one-shot form (`--engine`, `--wait <ms>`, `--screenshot`).
74
+ Force the engine globally with `export CRAWLFORGE_STEALTH_ENGINE=camoufox`.
75
+
76
+ ## localization (cost: 2)
77
+
78
+ Emulate a country/language/timezone/currency for region-specific content and
79
+ geo-blocked pages.
80
+
81
+ ```json
82
+ {
83
+ "tool": "localization",
84
+ "params": { "operation": "configure_country", "countryCode": "DE", "language": "de", "currency": "EUR" }
85
+ }
86
+ ```
87
+
88
+ Operations: `configure_country`, `localize_search`, `localize_browser`,
89
+ `generate_timezone_spoof`, `handle_geo_blocking`, `auto_detect`, `get_stats`,
90
+ `get_supported_countries`. `countryCode` is ISO 3166-1 alpha-2; `currency` is
91
+ ISO 4217. Supports proxy routing and GPS geolocation emulation.
92
+
93
+ CLI: `crawlforge localize https://shop.example.com --locale en-GB --country GB --currency GBP`.
94
+
95
+ ## stealth_mode vs localization
96
+
97
+ - **Blocked / bot-detected** → `stealth_mode`.
98
+ - **Wrong region / language / currency, but not blocked** → `localization`.
99
+ - **Geo-blocked AND bot-protected** → `localization` to set region context, then
100
+ `stealth_mode` for the actual fetch.
101
+
102
+ ## Cost note
103
+
104
+ `stealth_mode` = 5 credits per call (screenshots add a small extra cost);
105
+ `localization` = 2 credits. Try the cheaper `scrape` (2 credits) first and only
106
+ escalate to stealth when you see the block signals above.