crawlforge-mcp-server 4.6.5 → 4.7.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
package/README.md CHANGED
@@ -67,9 +67,9 @@
67
67
  npm install -g crawlforge-mcp-server
68
68
  ```
69
69
 
70
- ### 2. Setup Your API Key (optional for the free local tools)
70
+ ### 2. Setup Your API Key (required)
71
71
 
72
- The 15 free local tools work immediately with **no API key at all** skip straight to step 3 if that's all you need. To unlock the metered premium tools (`search_web`, `crawl_deep`, `stealth_mode`, `agent`, …):
72
+ Every tool requires a CrawlForge API key — new accounts get 1,000 free trial credits to start:
73
73
 
74
74
  ```bash
75
75
  npx crawlforge-setup
@@ -150,43 +150,38 @@ Restart Cursor to activate.
150
150
 
151
151
  ## 📊 Available Tools
152
152
 
153
- CrawlForge is **open-core**: 15 tools run locally on your machine and are **completely free no API key required**. The metered premium tools cover real infrastructure (search fees, proxies, browser farms) and need an API key.
154
-
155
- **Free Local Tools** (0 credits, no API key needed)
156
-
157
- | Tool | What it does |
158
- |------|--------------|
159
- | `fetch_url` | Fetch content from any URL |
160
- | `extract_text` | Extract clean text from web pages |
161
- | `extract_links` | Get all links from a page |
162
- | `extract_metadata` | Extract page metadata (title, OG tags, schema.org) |
163
- | `scrape` | **Unified single-fetch, multi-format extraction.** Pass a `formats` array (markdown/html/rawHtml/text/links/metadata/screenshot/json-schema) plus `onlyMainContent`; one fetch serves every requested format with per-format partial-success warnings. *The `screenshot` format is the one metered exception (2 credits — needs a server browser)* |
164
- | `scrape_structured` | Extract structured data with CSS selectors |
165
- | `scrape_template` | Structured data from well-known sites (Amazon, GitHub, LinkedIn, YouTube, Reddit, Hacker News, npm, and more) without writing selectors |
166
- | `extract_content` | Enhanced content extraction |
167
- | `summarize_content` | Generate intelligent summaries |
168
- | `analyze_content` | Comprehensive content analysis |
169
- | `extract_structured` | LLM-powered schema-driven extraction (your own LLM key or local Ollama) |
170
- | `extract_with_llm` | Natural-language extraction. **Defaults to a local Ollama model — no API key, no API costs.** Pass `provider: "openai" \| "anthropic"` with the matching key for cloud models |
171
- | `process_document` | Multi-format document processing |
172
- | `list_ollama_models` | List the Ollama models installed locally (helps you pick a `model` for `extract_with_llm`) |
173
- | `get_batch_results` | Retrieve paginated results for a `batch_scrape` job by `batchId` |
174
-
175
- **Metered Premium Tools** (3–10 credits, API key required)
153
+ CrawlForge requires a CrawlForge API key **every tool is metered and consumes credits**. New accounts get **1,000 free trial credits** to start. Get a key at [crawlforge.dev/signup](https://www.crawlforge.dev/signup).
154
+
155
+ **All Tools** (API key required)
176
156
 
177
157
  | Tool | Credits | What it does |
178
158
  |------|---------|--------------|
179
- | `map_site` | 3 | Discover and map website structure (optional `search=` ranks the discovered URLs) |
159
+ | `fetch_url` | 1 | Fetch content from any URL |
160
+ | `extract_text` | 1 | Extract clean text from web pages |
161
+ | `extract_links` | 1 | Get all links from a page |
162
+ | `extract_metadata` | 1 | Extract page metadata (title, OG tags, schema.org) |
163
+ | `scrape_template` | 1 | Structured data from well-known sites (Amazon, GitHub, LinkedIn, YouTube, Reddit, Hacker News, npm, and more) without writing selectors |
164
+ | `list_ollama_models` | 1 | List the Ollama models installed locally (helps you pick a `model` for `extract_with_llm`) |
165
+ | `get_batch_results` | 1 | Retrieve paginated results for a `batch_scrape` job by `batchId` |
166
+ | `scrape` | 2 | **Unified single-fetch, multi-format extraction.** Pass a `formats` array (markdown/html/rawHtml/text/links/metadata/screenshot/json-schema) plus `onlyMainContent`; one fetch serves every requested format with per-format partial-success warnings |
167
+ | `scrape_structured` | 2 | Extract structured data with CSS selectors |
168
+ | `extract_content` | 2 | Enhanced content extraction |
169
+ | `map_site` | 2 | Discover and map website structure (optional `search=` ranks the discovered URLs) |
170
+ | `process_document` | 2 | Multi-format document processing |
171
+ | `localization` | 2 | Multi-language and geo-location management |
180
172
  | `track_changes` | 3 | Monitor content changes over time |
173
+ | `analyze_content` | 3 | Comprehensive content analysis |
174
+ | `extract_structured` | 3 | LLM-powered schema-driven extraction (your own LLM key or local Ollama) |
175
+ | `extract_with_llm` | 3 | Natural-language extraction. Defaults to a local Ollama model; pass `provider: "openai" \| "anthropic"` with the matching key for cloud models (external LLM billed by your provider) |
176
+ | `summarize_content` | 4 | Generate intelligent summaries |
177
+ | `crawl_deep` | 4 | Deep crawl entire websites |
181
178
  | `search_web` | 5 | Search the web using Google Search API |
182
- | `crawl_deep` | 5 | Deep crawl entire websites |
183
179
  | `batch_scrape` | 5 | Process multiple URLs simultaneously |
184
180
  | `scrape_with_actions` | 5 | Browser automation chains |
185
181
  | `generate_llms_txt` | 5 | Generate AI interaction guidelines |
186
- | `localization` | 5 | Multi-language and geo-location management |
182
+ | `stealth_mode` | 5 | Anti-detection browser management |
187
183
  | `agent` | 8 | **Autonomous research/extraction from a natural-language prompt — no URLs required.** Plans, gathers, and shapes an answer under hard safety stops (max steps/URLs/wall-clock enforced by the orchestrator, never the LLM) |
188
184
  | `deep_research` | 10 | Multi-stage research with source verification |
189
- | `stealth_mode` | 10 | Anti-detection browser management |
190
185
 
191
186
  For the full canonical capabilities reference (all tools, CLI commands, stealth engines, research workflow), see [SKILL.md](SKILL.md).
192
187
 
@@ -194,7 +189,7 @@ For the full canonical capabilities reference (all tools, CLI commands, stealth
194
189
 
195
190
  ## 💳 Pricing
196
191
 
197
- **15 local tools are free forever no API key, no credit card.** Credits only meter the premium tools that run on CrawlForge infrastructure.
192
+ **Every tool is metered and requires an API key.** New accounts get 1,000 free trial credits no credit card required to start.
198
193
 
199
194
  | Plan | Credits/Month | Best For |
200
195
  |------|---------------|----------|
@@ -229,11 +224,16 @@ export OLLAMA_DEFAULT_MODEL="llama3.2" # default; any locally-pulled
229
224
  # Optional: Cloud LLM keys — only needed when you pass provider: "openai" or "anthropic"
230
225
  export OPENAI_API_KEY="sk-..."
231
226
  export ANTHROPIC_API_KEY="sk-ant-..."
227
+
228
+ # Optional: deep_research stealth extraction fallback (v4.6.6) — see below
229
+ export RESEARCH_STEALTH_ENGINE="auto" # auto (default) | camoufox | chromium
230
+ export RESEARCH_STEALTH_FALLBACK="true" # set to "false" to disable entirely
231
+ export RESEARCH_MAX_STEALTH_RETRIES="8" # cap on stealth retries per research run
232
232
  ```
233
233
 
234
234
  ### Local-LLM quickstart (`extract_with_llm` with Ollama)
235
235
 
236
- `extract_with_llm` defaults to a local Ollama model — no API key, no API costs, no data leaving your machine.
236
+ `extract_with_llm` defaults to a local Ollama model — no LLM-provider key, no per-token LLM costs, and no data leaving your machine (the CrawlForge credit cost still applies).
237
237
 
238
238
  ```bash
239
239
  # 1. Install Ollama: https://ollama.com
@@ -247,6 +247,31 @@ ollama pull llama3.2
247
247
  # extract_with_llm({ url: "https://example.com", prompt: "…", model: "llama3.2" })
248
248
  ```
249
249
 
250
+ ### Stealth extraction for `deep_research` (Camoufox)
251
+
252
+ `deep_research` automatically retries sources that block the normal fetch path (Reddit, Quora, forums, and Cloudflare/DataDome-protected pages return HTTP 403) through a **real fingerprinted browser**, then re-extracts from the rendered HTML. It's bounded (`RESEARCH_MAX_STEALTH_RETRIES`, default 8, plus a per-page timeout) and lazy — the browser stack only loads when a source is actually blocked.
253
+
254
+ Engine selection (`RESEARCH_STEALTH_ENGINE`):
255
+
256
+ - **`auto`** (default) — prefer **Camoufox** (Firefox anti-detect), fall back to Chromium stealth, then plain fetch.
257
+ - **`camoufox`** — force Camoufox.
258
+ - **`chromium`** — force the Chromium stealth engine.
259
+
260
+ Headless Chromium **cannot** clear modern challenges (Cloudflare Turnstile, DataDome) — **Camoufox can**. In testing it recovered Quora and Trustpilot pages that were otherwise fully blocked. To enable it, install the optional dependency and run its one-time binary fetch:
261
+
262
+ ```bash
263
+ # Camoufox is declared as an optional dependency, so a normal install already pulls it.
264
+ # If you installed with --no-optional, add it explicitly:
265
+ npm install camoufox
266
+
267
+ # One-time download of the Camoufox Firefox binary (~130 MB):
268
+ npx camoufox fetch
269
+ ```
270
+
271
+ Without the Camoufox binary, `deep_research` silently falls back to Chromium stealth and then to plain fetch — no errors, just lower recovery on heavily-protected sites. Disable the whole fallback with `RESEARCH_STEALTH_FALLBACK=false`.
272
+
273
+ > **Note:** Hard IP-reputation blocks (e.g. Reddit's edge `403`) resist headless stealth from any IP and require residential/mobile proxies, which CrawlForge does not provide. See [docs/stealth-engines.md](docs/stealth-engines.md) for details.
274
+
250
275
  ### Manual Configuration
251
276
 
252
277
  Your configuration is stored at `~/.crawlforge/config.json`:
@@ -287,7 +312,7 @@ Once configured, use these tools in your AI assistant:
287
312
  - **Action allowlist**: `scrape_with_actions` accepts only 7 action types (`wait`, `click`, `type`, `press`, `scroll`, `screenshot`, `executeJavaScript`). No download, file-write, or arbitrary cross-page navigation primitives exist.
288
313
  - **JavaScript gate**: The `executeJavaScript` action throws by default. Set `ALLOW_JAVASCRIPT_EXECUTION=true` at deploy time to enable (not recommended in production).
289
314
  - **MCP Elicitation** (v3.6.0): Four tools request user confirmation before executing expensive operations — `deep_research` (>50 URLs), `batch_scrape` (sync mode, >25 URLs), `crawl_deep` (projected >500 pages), `extract_structured` (schema has >3 required fields with no LLM configured). Credit-low situations also elicit. Confirmation is best-effort: if the MCP client does not support elicitation the tool proceeds (fail-open).
290
- - **Per-tool credit gating**: Every tool is wrapped with `withAuth()`; metered tools check and deduct credits before execution (fail-closed since v3.0.18). Free local tools (cost 0) skip the credit path entirely.
315
+ - **Per-tool credit gating**: Every tool is wrapped with `withAuth()` and is metered credits are checked and deducted before execution, and a valid API key is required for every tool (fail-closed since v3.0.18).
291
316
 
292
317
  See [docs/sandboxing-and-approvals.md](docs/sandboxing-and-approvals.md) for the full reference.
293
318
 
package/package.json CHANGED
@@ -1,6 +1,6 @@
1
1
  {
2
2
  "name": "crawlforge-mcp-server",
3
- "version": "4.6.5",
3
+ "version": "4.7.0",
4
4
  "mcpName": "io.github.mysleekdesigns/crawlforge-mcp-server",
5
5
  "description": "CrawlForge MCP Server - Professional Model Context Protocol server with 26 web scraping, crawling, deep-research, and autonomous-extraction tools. Returns clean Markdown and structured JSON for Claude, Cursor, and any MCP client. Defaults to local Ollama for LLM extraction (no API key needed); OpenAI/Anthropic available as opt-in. Includes a unified multi-format scrape tool, an autonomous agent, pre-built site templates, and Camoufox stealth browsing.",
6
6
  "main": "server.js",
@@ -131,6 +131,9 @@
131
131
  "winston": "^3.11.0",
132
132
  "zod": "^3.23.8"
133
133
  },
134
+ "optionalDependencies": {
135
+ "camoufox": "^0.1.19"
136
+ },
134
137
  "devDependencies": {
135
138
  "@jest/globals": "^30.3.0",
136
139
  "c8": "^11.0.0",
package/server.js CHANGED
@@ -68,15 +68,14 @@ if (!AuthManager.isAuthenticated() && !AuthManager.isCreatorMode()) {
68
68
  process.exit(1);
69
69
  }
70
70
  } else {
71
- // Open-core Phase 2: no API key is fine start in free-tier mode.
72
- // Tier-0 tools (cost 0) run locally without a key; Tier-1 metered tools
73
- // return a "not configured" error until a key is set.
71
+ // Every tool is metered and requires an API key there is no free tier.
72
+ // The server still starts so the MCP client can list tools, but every
73
+ // tool call errors with "not configured" until a key is set.
74
74
  // Status → stderr; stdout is reserved for the MCP JSON-RPC stream.
75
- console.error('ℹ️ CrawlForge running in free-tier mode (no API key configured).');
76
- console.error(' Free local tools work out of the box. Premium tools (search_web,');
77
- console.error(' crawl_deep, stealth_mode, agent, deep_research, …) need an API key:');
78
- console.error(' get one at https://www.crawlforge.dev/signup, then run `npm run setup`');
79
- console.error(' or set CRAWLFORGE_API_KEY.');
75
+ console.error('⚠️ No CrawlForge API key configured all tools require a key.');
76
+ console.error(' Every tool (fetch_url, search_web, deep_research, …) is metered.');
77
+ console.error(' Get a key at https://www.crawlforge.dev/signup, then run `npm run setup`');
78
+ console.error(' or set CRAWLFORGE_API_KEY. Tool calls will error until a key is set.');
80
79
  }
81
80
  }
82
81
 
@@ -90,7 +89,7 @@ if (configErrors.length > 0 && config.server.nodeEnv === 'production') {
90
89
  // Create the server
91
90
  const server = new McpServer({
92
91
  name: "crawlforge",
93
- version: "4.6.5",
92
+ version: "4.7.0",
94
93
  description: "Production-ready MCP server with 26 web scraping, crawling, and content processing tools. Features MCP Resources (crawlforge://), Prompts, Sampling fallback, Elicitation, stealth browsing, deep research, structured extraction, change tracking, local-LLM extraction via Ollama, unified multi-format scrape, and autonomous agent tool.",
95
94
  homepage: "https://www.crawlforge.dev",
96
95
  icon: "https://www.crawlforge.dev/icon.png"
@@ -239,11 +239,6 @@ class AuthManager {
239
239
  return true;
240
240
  }
241
241
 
242
- // Open-core Phase 2: Tier-0 tools cost 0 and run without an API key
243
- if (estimatedCredits === 0) {
244
- return true;
245
- }
246
-
247
242
  if (!this.config) {
248
243
  throw new Error('CrawlForge not configured. Run setup first.');
249
244
  }
@@ -507,51 +502,53 @@ class AuthManager {
507
502
  /**
508
503
  * Get credit cost for a tool.
509
504
  *
510
- * Open-core Phase 1 (docs/tier-map.md): this table is the single source of
511
- * truth shared with the backend (crawlforge-website/src/lib/credits.ts).
512
- * Tier 0 tools run locally on the user's machine and cost 0; Tier 1 tools
513
- * are metered per COGS.
505
+ * Every tool is metered and requires an API key — there is no free tier.
506
+ * This table is the single source of truth shared with the backend
507
+ * (crawlforge-website/src/lib/credits.ts TOOL_CREDIT_COSTS).
514
508
  *
515
509
  * @param {string} tool
516
- * @param {object} [params] — invocation params; only used for per-call
517
- * exceptions (scrape's screenshot format needs a server browser).
518
510
  */
519
- getToolCost(tool, params) {
520
- // Tier-0 exception: the screenshot format of `scrape` is browser-backed
521
- if (tool === 'scrape' && Array.isArray(params?.formats) && params.formats.includes('screenshot')) {
522
- return 2;
523
- }
524
-
511
+ getToolCost(tool) {
525
512
  const costs = {
526
- // Tier 0 — free, local (key optional)
527
- fetch_url: 0,
528
- extract_text: 0,
529
- extract_links: 0,
530
- extract_metadata: 0,
531
- scrape_structured: 0,
532
- scrape_template: 0,
533
- extract_content: 0,
534
- scrape: 0, // 2 if formats includes 'screenshot' (handled above)
535
- summarize_content: 0,
536
- analyze_content: 0,
537
- extract_with_llm: 0,
538
- extract_structured: 0,
539
- process_document: 0,
540
- list_ollama_models: 0,
541
- get_batch_results: 0, // retrieval of an already-paid batch job
542
-
543
- // Tier 1 — metered (costs reflect COGS)
544
- map_site: 3,
513
+ // 1 credit
514
+ fetch_url: 1,
515
+ extract_text: 1,
516
+ extract_links: 1,
517
+ extract_metadata: 1,
518
+ scrape_template: 1,
519
+ list_ollama_models: 1,
520
+ get_batch_results: 1, // retrieval of an already-paid batch job
521
+
522
+ // 2 credits
523
+ scrape_structured: 2,
524
+ extract_content: 2,
525
+ map_site: 2,
526
+ process_document: 2,
527
+ localization: 2,
528
+ scrape: 2,
529
+
530
+ // 3 credits
545
531
  track_changes: 3,
546
- generate_llms_txt: 5,
547
- search_web: 5,
548
- crawl_deep: 5,
549
- batch_scrape: 5,
532
+ analyze_content: 3,
533
+ extract_structured: 3,
534
+ extract_with_llm: 3,
535
+
536
+ // 4 credits
537
+ summarize_content: 4,
538
+ crawl_deep: 4,
539
+
540
+ // 5 credits
541
+ stealth_mode: 5,
550
542
  scrape_with_actions: 5,
551
- localization: 5,
543
+ batch_scrape: 5,
544
+ search_web: 5,
545
+ generate_llms_txt: 5,
546
+
547
+ // 8 credits
552
548
  agent: 8, // projectCost() scales with maxUrls
553
- deep_research: 10,
554
- stealth_mode: 10
549
+
550
+ // 10 credits
551
+ deep_research: 10
555
552
  };
556
553
 
557
554
  return costs[tool] ?? 1;
@@ -574,7 +571,7 @@ class AuthManager {
574
571
 
575
572
  // Override for tools whose cost scales with params
576
573
  let projected = base;
577
- let note = base === 0 ? 'Free local tool — no credits charged.' : 'Fixed cost per invocation.';
574
+ let note = 'Fixed cost per invocation.';
578
575
 
579
576
  switch (toolName) {
580
577
  case 'batch_scrape': {
@@ -596,14 +593,11 @@ class AuthManager {
596
593
  break;
597
594
  }
598
595
  case 'extract_with_llm':
599
- note = 'Free local tool. External LLM API call billed by your LLM provider, not in credits.';
596
+ note = 'External LLM API call billed by your LLM provider, separate from the credit cost.';
600
597
  break;
601
598
  case 'scrape': {
602
- // Free local tool; only the browser-backed screenshot format is metered
603
599
  projected = base;
604
- note = base > 0
605
- ? 'screenshot format requires a server browser (2 credits). Other formats are free.'
606
- : 'Free local tool — no credits charged. json format may incur external LLM cost.';
600
+ note = 'Fixed cost per invocation. json format may incur external LLM cost (billed by your provider).';
607
601
  break;
608
602
  }
609
603
  case 'agent': {
@@ -614,7 +608,7 @@ class AuthManager {
614
608
  break;
615
609
  }
616
610
  default:
617
- note = base === 0 ? 'Free local tool — no credits charged.' : 'Fixed cost per invocation.';
611
+ note = 'Fixed cost per invocation.';
618
612
  }
619
613
 
620
614
  return { projected, note };
@@ -35,6 +35,19 @@ export class ResearchOrchestrator extends EventEmitter {
35
35
  cacheEnabled = true,
36
36
  cacheTTL = 1800000, // 30 minutes
37
37
  researchApproach = 'broad',
38
+ // Stealth-browser fallback for sources that block the plain fetch/extract
39
+ // path (Reddit, Quora, forums → HTTP 403). On by default; bounded so it
40
+ // cannot blow the research time budget. Disable with
41
+ // RESEARCH_STEALTH_FALLBACK=false.
42
+ enableStealthFallback = process.env.RESEARCH_STEALTH_FALLBACK !== 'false',
43
+ maxStealthRetries = parseInt(process.env.RESEARCH_MAX_STEALTH_RETRIES || '8', 10),
44
+ // 'auto' (default) prefers Camoufox (Firefox anti-detect — beats
45
+ // Cloudflare/DataDome that headless Chromium can't) and falls back to
46
+ // Chromium stealth when Camoufox/its binary is unavailable. Force one
47
+ // with RESEARCH_STEALTH_ENGINE=camoufox|chromium.
48
+ stealthEngine = process.env.RESEARCH_STEALTH_ENGINE || 'auto',
49
+ stealthLevel = 'medium',
50
+ stealthTimeoutMs = 20000,
38
51
  searchConfig = {},
39
52
  crawlConfig = {},
40
53
  extractConfig = {},
@@ -49,6 +62,18 @@ export class ResearchOrchestrator extends EventEmitter {
49
62
  this.enableSourceVerification = enableSourceVerification;
50
63
  this.enableConflictDetection = enableConflictDetection;
51
64
 
65
+ // Stealth fallback config + lazy state (browser launched only on first block)
66
+ this.enableStealthFallback = enableStealthFallback;
67
+ this.maxStealthRetries = Math.max(0, maxStealthRetries);
68
+ this.stealthEngine = stealthEngine;
69
+ this.stealthLevel = stealthLevel;
70
+ this.stealthTimeoutMs = stealthTimeoutMs;
71
+ this._stealthManager = null; // Chromium StealthBrowserManager (fallback engine)
72
+ this._stealthBrowser = null; // Camoufox browser handle (preferred engine)
73
+ this._stealthEngineActive = null;
74
+ this._stealthInit = null;
75
+ this._stealthCount = 0;
76
+
52
77
  // Initialize tools
53
78
  this.searchTool = new SearchWebTool(searchConfig);
54
79
  this.crawlTool = new CrawlDeepTool(crawlConfig);
@@ -101,7 +126,9 @@ export class ResearchOrchestrator extends EventEmitter {
101
126
  llmAnalysisCalls: 0,
102
127
  semanticAnalysisTime: 0,
103
128
  queryExpansionTime: 0,
104
- synthesisTime: 0
129
+ synthesisTime: 0,
130
+ stealthRetries: 0,
131
+ stealthRecovered: 0
105
132
  };
106
133
  }
107
134
 
@@ -203,6 +230,9 @@ export class ResearchOrchestrator extends EventEmitter {
203
230
  Object.keys(this.metrics).forEach(key => {
204
231
  this.metrics[key] = 0;
205
232
  });
233
+
234
+ // Reset per-run stealth-retry budget
235
+ this._stealthCount = 0;
206
236
  }
207
237
 
208
238
  /**
@@ -551,11 +581,38 @@ export class ResearchOrchestrator extends EventEmitter {
551
581
  }
552
582
 
553
583
  // Normalize content to string (extract_content returns {text: "..."}, fallback returns string)
554
- const contentText = contentData && contentData.content
555
- ? (typeof contentData.content === 'string'
556
- ? contentData.content
557
- : (contentData.content.text || ''))
584
+ const normalizeContent = (cd) => cd && cd.content
585
+ ? (typeof cd.content === 'string' ? cd.content : (cd.content.text || ''))
558
586
  : '';
587
+ let contentText = normalizeContent(contentData);
588
+
589
+ // Stealth fallback: high-value discussion sources (Reddit, Quora,
590
+ // forums) return HTTP 403 to the plain fetch/extract path. When the
591
+ // normal path produced no usable content, retry through a real
592
+ // fingerprinted browser and re-run extraction on the rendered HTML.
593
+ // Bounded by maxStealthRetries + a per-page timeout.
594
+ const blocked = !contentData || contentData.success === false || contentText.trim().length === 0;
595
+ if (blocked && this.enableStealthFallback && this._stealthCount < this.maxStealthRetries) {
596
+ this._stealthCount++;
597
+ this.metrics.stealthRetries++;
598
+ try {
599
+ const stealthHtml = await this._stealthFetchHtml(source.link);
600
+ if (stealthHtml) {
601
+ contentData = await this.extractTool.execute({
602
+ url: source.link,
603
+ html: stealthHtml,
604
+ options: { includeMetadata: true, includeStructuredData: true }
605
+ });
606
+ contentText = normalizeContent(contentData);
607
+ if (contentData && contentData.success !== false && contentText.trim().length > 0) {
608
+ this.metrics.stealthRecovered++;
609
+ this.logActivity('stealth_recovery', { url: source.link });
610
+ }
611
+ }
612
+ } catch (stealthError) {
613
+ this.logger.warn('Stealth fallback failed', { url: source.link, error: stealthError.message });
614
+ }
615
+ }
559
616
 
560
617
  // Only count and enhance sources that actually produced non-empty content.
561
618
  // Skip failed extractions and empty {text:""} results.
@@ -641,10 +698,134 @@ export class ResearchOrchestrator extends EventEmitter {
641
698
  }
642
699
  });
643
700
 
701
+ // Tear down the stealth browser as soon as the extraction stage is done —
702
+ // it is only needed here and would otherwise leak a Playwright handle.
703
+ await this._closeStealth();
704
+
644
705
  // Sort by relevance score (LLM or traditional)
645
706
  return detailedFindings.sort((a, b) => (b.relevanceScore || 0) - (a.relevanceScore || 0));
646
707
  }
647
708
 
709
+ /**
710
+ * Lazily launch the stealth browser once. The heavy browser stack is only
711
+ * loaded when a source actually blocks the plain path. Engine selection:
712
+ * - 'camoufox'/'auto' → Camoufox (Firefox anti-detect). Loaded via the CJS
713
+ * build (its ESM bundle has a broken dynamic-require). Beats Cloudflare/
714
+ * DataDome challenges that patched headless Chromium can't pass.
715
+ * - 'chromium', or any Camoufox failure under 'auto' → StealthBrowserManager.
716
+ */
717
+ async _getStealthBrowser() {
718
+ if (!this._stealthInit) {
719
+ this._stealthInit = (async () => {
720
+ if (this.stealthEngine === 'camoufox' || this.stealthEngine === 'auto') {
721
+ try {
722
+ const { createRequire } = await import('module');
723
+ const require = createRequire(import.meta.url);
724
+ const camoufox = require('camoufox'); // CJS build — ESM build is broken
725
+ await this._ensureCamoufoxLayout(camoufox);
726
+ this._stealthBrowser = await camoufox.Camoufox({ headless: true });
727
+ this._stealthEngineActive = 'camoufox';
728
+ this.logger.info('Stealth fallback using Camoufox (Firefox) engine');
729
+ return;
730
+ } catch (e) {
731
+ if (this.stealthEngine === 'camoufox') throw e; // explicit request → surface
732
+ this.logger.warn('Camoufox unavailable, falling back to Chromium stealth', { error: e.message });
733
+ }
734
+ }
735
+ const { StealthBrowserManager } = await import('./StealthBrowserManager.js');
736
+ this._stealthManager = new StealthBrowserManager();
737
+ await this._stealthManager.launchStealthBrowser({ level: this.stealthLevel });
738
+ this._stealthEngineActive = 'chromium';
739
+ })();
740
+ }
741
+ await this._stealthInit;
742
+ }
743
+
744
+ /**
745
+ * macOS packaging fix for camoufox-js: it expects properties.json in
746
+ * Camoufox.app/Contents/MacOS/, but the .app bundle ships it under
747
+ * Contents/Resources/. Bridge it so the launcher can boot. Best-effort.
748
+ */
749
+ async _ensureCamoufoxLayout(camoufox) {
750
+ if (process.platform !== 'darwin' || !camoufox?.INSTALL_DIR) return;
751
+ try {
752
+ const fs = await import('fs');
753
+ const path = await import('path');
754
+ const appDir = path.join(camoufox.INSTALL_DIR, 'Camoufox.app', 'Contents');
755
+ const target = path.join(appDir, 'MacOS', 'properties.json');
756
+ const source = path.join(appDir, 'Resources', 'properties.json');
757
+ if (!fs.existsSync(target) && fs.existsSync(source)) {
758
+ fs.copyFileSync(source, target);
759
+ }
760
+ } catch { /* best-effort; launch surfaces a real error if it matters */ }
761
+ }
762
+
763
+ /**
764
+ * Fetch a URL's fully-rendered HTML through the stealth browser. Returns the
765
+ * HTML string, or null if every attempt was blocked / empty.
766
+ *
767
+ * Cloudflare/DataDome challenges are probabilistic — the same URL may serve a
768
+ * challenge on one load and the real page on the next — so Camoufox retries a
769
+ * few times with a fresh page each attempt. Chromium can't clear these at all
770
+ * (proven), so it gets a single attempt to avoid burning the time budget.
771
+ */
772
+ async _stealthFetchHtml(url) {
773
+ await this._getStealthBrowser();
774
+ const attempts = this._stealthEngineActive === 'camoufox' ? 3 : 1;
775
+ for (let i = 0; i < attempts; i++) {
776
+ const html = await this._stealthFetchOnce(url);
777
+ if (html) return html;
778
+ }
779
+ return null;
780
+ }
781
+
782
+ /** One stealth navigation. Fresh page/context; judges blocked by rendered content. */
783
+ async _stealthFetchOnce(url) {
784
+ let page;
785
+ if (this._stealthEngineActive === 'camoufox') {
786
+ page = await this._stealthBrowser.newPage();
787
+ } else {
788
+ const { contextId } = await this._stealthManager.createStealthContext({ level: this.stealthLevel });
789
+ page = await this._stealthManager.createStealthPage(contextId);
790
+ }
791
+ try {
792
+ const resp = await page.goto(url, { waitUntil: 'domcontentloaded', timeout: this.stealthTimeoutMs });
793
+ // Do NOT bail on the initial HTTP status: anti-bot challenges (Cloudflare
794
+ // Turnstile) return 403 on the first response and only resolve to the
795
+ // real page after their JS runs. Let it settle, then judge by the
796
+ // *rendered* content instead.
797
+ await page.waitForLoadState('networkidle', { timeout: 8000 }).catch(() => {});
798
+ await page.waitForTimeout(2500).catch(() => {});
799
+ const html = await page.content();
800
+ const title = (await page.title().catch(() => '')) || '';
801
+ const bodyLen = await page.evaluate(() => document.body?.innerText?.trim().length || 0).catch(() => 0);
802
+
803
+ // Still a challenge/block page → treat as blocked.
804
+ const challengeTitle = /just a moment|checking your browser|attention required|verify you are human|access denied|^blocked$/i.test(title);
805
+ const status = resp ? resp.status() : 0;
806
+ if (challengeTitle) return null;
807
+ if (status >= 400 && bodyLen < 500) return null; // hard block (e.g. Reddit 403 shell)
808
+ if (bodyLen < 200) return null; // empty / interstitial
809
+ return html && html.length > 200 ? html : null;
810
+ } finally {
811
+ await page.close().catch(() => {});
812
+ }
813
+ }
814
+
815
+ /** Close the stealth browser and reset its lazy state (idempotent). */
816
+ async _closeStealth() {
817
+ try {
818
+ if (this._stealthBrowser) await this._stealthBrowser.close().catch(() => {});
819
+ if (this._stealthManager) await this._stealthManager.cleanup().catch(() => {});
820
+ } catch (e) {
821
+ this.logger.warn('Stealth browser cleanup failed', { error: e.message });
822
+ }
823
+ this._stealthBrowser = null;
824
+ this._stealthManager = null;
825
+ this._stealthEngineActive = null;
826
+ this._stealthInit = null;
827
+ }
828
+
648
829
  /**
649
830
  * Verify source credibility using multiple factors
650
831
  */
@@ -1484,7 +1665,10 @@ export class ResearchOrchestrator extends EventEmitter {
1484
1665
  try {
1485
1666
  // Stop any active research
1486
1667
  this.stopResearch();
1487
-
1668
+
1669
+ // Tear down the stealth browser if one was launched
1670
+ await this._closeStealth();
1671
+
1488
1672
  // Clear cache if available
1489
1673
  if (this.cache && typeof this.cache.clear === "function") {
1490
1674
  await this.cache.clear();
@@ -1522,9 +1706,11 @@ export class ResearchOrchestrator extends EventEmitter {
1522
1706
  llmAnalysisCalls: 0,
1523
1707
  semanticAnalysisTime: 0,
1524
1708
  queryExpansionTime: 0,
1525
- synthesisTime: 0
1709
+ synthesisTime: 0,
1710
+ stealthRetries: 0,
1711
+ stealthRecovered: 0
1526
1712
  };
1527
-
1713
+
1528
1714
  } catch (error) {
1529
1715
  // Silent cleanup - do not throw errors during cleanup
1530
1716
  console.warn("Warning during ResearchOrchestrator cleanup:", error.message);
@@ -4,8 +4,8 @@
4
4
  * (OpenTelemetry spans + Prometheus counters) added in v3.2.0.
5
5
  *
6
6
  * Contract:
7
- * - resolves toolCost once per call (params-aware; 0-cost Tier-0 tools skip
8
- * the credit check and usage reports entirely open-core Phase 2)
7
+ * - resolves toolCost once per call; every tool is metered (no free tier),
8
+ * so a valid API key is required for every invocation
9
9
  * - try/finally guarantees a single `tool invocation` log line per call
10
10
  * - log payload: { toolName, paramHash, durationMs, outcome, creditCost, creatorMode }
11
11
  * - outcome ∈ { 'success' | 'error' | 'insufficient_credits' }
@@ -36,16 +36,12 @@ export function makeWithAuth({ authManager, logger, metrics = null }) {
36
36
  const startTime = Date.now();
37
37
  const paramHash = hashParams(params);
38
38
  const creatorMode = authManager.isCreatorMode();
39
- // Params-aware: scrape's screenshot format is metered, other formats free
40
39
  const creditCost = creatorMode ? 0 : authManager.getToolCost(toolName, params);
41
- // Open-core Phase 2: Tier-0 tools (cost 0) run locally for free — no
42
- // credit check, no usage report, and no API key required.
43
- const freeTier = creditCost === 0;
44
40
  let outcome = 'pending';
45
41
  let thrown = null;
46
42
 
47
43
  try {
48
- if (!creatorMode && !freeTier) {
44
+ if (!creatorMode) {
49
45
  const hasCredits = await authManager.checkCredits(creditCost);
50
46
  if (!hasCredits) {
51
47
  outcome = 'insufficient_credits';
@@ -90,7 +86,7 @@ export function makeWithAuth({ authManager, logger, metrics = null }) {
90
86
  // Cost injection must never break the request path
91
87
  }
92
88
 
93
- if (!creatorMode && !freeTier) {
89
+ if (!creatorMode) {
94
90
  await authManager.reportUsage(toolName, creditCost, params, 200, Date.now() - startTime);
95
91
  }
96
92
 
@@ -98,7 +94,7 @@ export function makeWithAuth({ authManager, logger, metrics = null }) {
98
94
  } catch (error) {
99
95
  outcome = 'error';
100
96
  thrown = error;
101
- if (!creatorMode && !freeTier) {
97
+ if (!creatorMode) {
102
98
  await authManager.reportUsage(
103
99
  toolName,
104
100
  Math.max(1, Math.floor(creditCost * 0.5)),
@@ -11,6 +11,11 @@ import { htmlToMarkdown } from '../../utils/htmlToMarkdown.js'; // D3.1
11
11
 
12
12
  const ExtractContentSchema = z.object({
13
13
  url: z.string().url(),
14
+ // Pre-rendered HTML to process directly instead of fetching `url` (e.g. a
15
+ // post-action page from scrape_with_actions, or a stealth-browser render in
16
+ // deep_research). Without this field Zod stripped it and the tool always
17
+ // re-fetched the URL — silently defeating any pre-fetched-HTML caller.
18
+ html: z.string().optional(),
14
19
  options: z.object({
15
20
  // Content extraction options
16
21
  useReadability: z.boolean().default(true),
@@ -79,9 +79,10 @@ export class SearchWebTool {
79
79
  // Check for Creator Mode - allows search without API key for development/testing
80
80
  const isCreatorMode = isCreatorModeVerified();
81
81
 
82
- // Open-core Phase 2: no API key is allowed at construction time (the server
83
- // now starts in free-tier mode without one). The key requirement is
84
- // enforced at execute() time instead, so Tier-0 tools keep working.
82
+ // The server can start without a key so the MCP client can list tools, so
83
+ // construction must not throw here. Every tool is metered and the key
84
+ // requirement is enforced before execute() runs (withAuth credit check)
85
+ // and again at execute() time below.
85
86
  if (!apiKey && !isCreatorMode) {
86
87
  this.searchAdapter = null;
87
88
  this.isCreatorModeFallback = false;
@@ -127,7 +128,7 @@ export class SearchWebTool {
127
128
  }
128
129
  // --- end SearXNG short-circuit ---
129
130
 
130
- // Free-tier mode: search via the CrawlForge proxy needs an API key
131
+ // Search via the CrawlForge proxy needs an API key
131
132
  if (!this.searchAdapter) {
132
133
  throw new Error('CrawlForge API key is required for search functionality. Get one at https://www.crawlforge.dev/signup');
133
134
  }