npm - crawlforge-mcp-server - Versions diffs - 4.7.2 → 4.8.0 - Mend

crawlforge-mcp-server 4.7.2 → 4.8.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (50) hide show

package/CLAUDE.md +2 -2
package/package.json +2 -1
package/server.js +42 -9
package/src/cli/commands/init.js +13 -2
package/src/cli/commands/install-skills.js +10 -1
package/src/cli/commands/monitor.js +81 -0
package/src/cli/commands/uninstall-skills.js +10 -1
package/src/core/ActionExecutor.js +51 -9
package/src/core/ElicitationHelper.js +18 -5
package/src/core/LLMsTxtAnalyzer.js +2 -1
package/src/core/MonitorScheduler.js +281 -0
package/src/core/MonitorStore.js +79 -0
package/src/core/ResearchOrchestrator.js +2 -1
package/src/core/crawlers/BFSCrawler.js +2 -1
package/src/skills/agent-skills/crawlforge-batch-automation/SKILL.md +126 -0
package/src/skills/agent-skills/crawlforge-batch-automation/references/actions.md +127 -0
package/src/skills/agent-skills/crawlforge-change-tracking/SKILL.md +116 -0
package/src/skills/agent-skills/crawlforge-deep-research/SKILL.md +108 -0
package/src/skills/agent-skills/crawlforge-deep-research/references/workflows.md +76 -0
package/src/skills/agent-skills/crawlforge-getting-started/SKILL.md +89 -0
package/src/skills/agent-skills/crawlforge-getting-started/references/cli.md +71 -0
package/src/skills/agent-skills/crawlforge-getting-started/references/credits.md +75 -0
package/src/skills/agent-skills/crawlforge-stealth-browsing/SKILL.md +106 -0
package/src/skills/agent-skills/crawlforge-stealth-browsing/references/engine-selection.md +63 -0
package/src/skills/agent-skills/crawlforge-structured-extraction/SKILL.md +121 -0
package/src/skills/agent-skills/crawlforge-structured-extraction/references/templates.md +39 -0
package/src/skills/agent-skills/crawlforge-web-scraping/SKILL.md +141 -0
package/src/skills/agent-skills/crawlforge-web-scraping/references/tool-reference.md +95 -0
package/src/skills/installer.js +186 -34
package/src/tools/advanced/batchScrape/worker.js +8 -2
package/src/tools/basic/_fetch.js +14 -1
package/src/tools/crawl/_sessionContext.js +3 -1
package/src/tools/extract/_fetchAndParse.js +2 -1
package/src/tools/extract/extractContent.js +2 -1
package/src/tools/extract/processDocument.js +2 -1
package/src/tools/scrape/_brandingExtractor.js +378 -0
package/src/tools/scrape/unifiedScrape.js +66 -6
package/src/tools/templates/ScrapeTemplateTool.js +2 -1
package/src/tools/tracking/trackChanges/differ.js +3 -1
package/src/tools/tracking/trackChanges/index.js +74 -21
package/src/tools/tracking/trackChanges/schema.js +7 -2
package/src/utils/hostRateLimiter.js +46 -0
package/src/utils/robotsChecker.js +2 -1
package/src/utils/sitemapParser.js +2 -1
package/src/utils/ssrfGuard.js +161 -0
package/src/utils/ssrfProtection.js +6 -9
package/src/skills/crawlforge-cli.md +0 -157
package/src/skills/crawlforge-mcp.md +0 -80
package/src/skills/crawlforge-research.md +0 -104
package/src/skills/crawlforge-stealth.md +0 -98

package/src/skills/agent-skills/crawlforge-stealth-browsing/references/engine-selection.md ADDED Viewed

@@ -0,0 +1,63 @@
+# Stealth Engine Selection
+`stealth_mode` supports two browser engines via the `engine` parameter.
+## playwright (default)
+- Chromium-based with stealth patches applied.
+- Masks `webdriver`, User-Agent, and navigator properties.
+- Lower resource usage, faster startup.
+- Good for most sites with basic bot detection.
+## camoufox
+- Firefox-based with **native** anti-detection (no runtime patches — uses
+  Firefox's genuine properties).
+- Scores higher on CreepJS and against DataDome than patched Chromium.
+- Heavier; install with `npm install camoufox`.
+- Use for advanced fingerprinting: financial, trading, and e-commerce sites.
+## Decision table
+| Scenario | Recommended engine |
+|----------|--------------------|
+| General JS-rendered sites | playwright |
+| Basic bot-detection bypass | playwright |
+| Speed-critical scraping | playwright |
+| Cloudflare-protected sites | camoufox |
+| Sites with DataDome | camoufox |
+| Sites with PerimeterX | camoufox |
+| Financial / trading sites | camoufox |
+## stealthConfig levels
+| Level | Behavior |
+|-------|----------|
+| `basic` | Minimal fingerprint masking; lowest overhead. |
+| `medium` (default) | Balanced fingerprint randomization + header spoofing. |
+| `advanced` | Full anti-detection: human-behavior simulation, canvas/WebGL/audio/font/hardware spoofing, WebRTC blocking, timezone spoofing. |
+## Key stealthConfig fields
+- `randomizeFingerprint`, `hideWebDriver`, `blockWebRTC`, `spoofTimezone`,
+  `randomizeHeaders`, `useRandomUserAgent`, `simulateHumanBehavior`.
+- `customUserAgent`, `customViewport {width,height}`, `locale`, `timezone`.
+- `proxyRotation { enabled, proxies[], rotationInterval }`.
+- `antiDetection { cloudflareBypass, recaptchaHandling, hideAutomation,
+  spoofMediaDevices, spoofBatteryAPI }`.
+- `fingerprinting { canvasNoise, webglSpoofing, audioContextSpoofing,
+  fontSpoofing, hardwareSpoofing }`.
+## Global override
+```bash
+export CRAWLFORGE_STEALTH_ENGINE=camoufox
+```
+Forces the engine for all stealth calls regardless of the `engine` parameter.
+## Sandboxing note
+Stealth Chromium runs with `--no-sandbox` + `--disable-web-security` (a
+deliberate fingerprint-spoofing trade-off). Camoufox (Firefox) is the
+alternative when that trade-off is unacceptable.

package/src/skills/agent-skills/crawlforge-structured-extraction/SKILL.md ADDED Viewed

@@ -0,0 +1,121 @@
+---
+name: crawlforge-structured-extraction
+description: "Extracts structured JSON and analyzes content with CrawlForge's extract_structured, extract_with_llm, scrape_structured, scrape_template, process_document, analyze_content, summarize_content, and list_ollama_models tools. Use when the user wants to extract specific fields, pull data into a JSON schema, extract by natural-language prompt, scrape with CSS selectors, get product, profile, or repo data from known sites (Amazon, LinkedIn, GitHub, YouTube, Reddit, and more), parse a PDF or DOCX, summarize a page, or analyze sentiment, entities, or keywords. Defaults to local Ollama for LLM extraction; OpenAI and Anthropic optional."
+metadata:
+  version: 4.8.0
+  source: crawlforge-mcp-server
+---
+# CrawlForge Structured Extraction & Analysis
+Turn pages and documents into structured data, and run NLP analysis. Pick the
+extraction method by how predictable the page is and whether an LLM is needed.
+## Tool selection
+| You have / want | Tool | Cost |
+|-----------------|------|------|
+| A well-known site (Amazon, GitHub, LinkedIn...) | `scrape_template` | 1 |
+| Exact CSS selectors for the fields | `scrape_structured` | 2 |
+| A JSON schema to fill (LLM, CSS fallback) | `extract_structured` | 3 |
+| A natural-language extraction instruction | `extract_with_llm` | 3 |
+| A PDF / DOCX / TXT to parse | `process_document` | 2 |
+| A summary of long text | `summarize_content` | 4 |
+| Sentiment / entities / keywords / readability | `analyze_content` | 3 |
+| To list local LLMs available for extraction | `list_ollama_models` | 1 |
+Cheapest-first rule: try `scrape_template` → `scrape_structured` (deterministic)
+before reaching for the LLM tools.
+## scrape_template — known sites, zero selectors (cost: 1)
+```json
+{ "tool": "scrape_template", "params": { "template": "github-repo", "url": "https://github.com/user/repo" } }
+```
+Pass `template:"list"` to enumerate templates. Supports `amazon-product`,
+`linkedin-profile`, `github-repo`, `youtube-video`, `tweet`, `reddit-thread`,
+`hacker-news-front-page`, `producthunt-launch`, `stackoverflow-question`,
+`npm-package`. Full field lists: [templates](references/templates.md).
+CLI: `crawlforge template github-repo https://github.com/owner/repo`.
+## scrape_structured — CSS selectors (cost: 2)
+```json
+{
+  "tool": "scrape_structured",
+  "params": { "url": "https://shop.com/products", "selectors": { "price": ".price", "name": ".product-title", "link": "a.product@href" } }
+}
+```
+Most reliable for consistent markup. Append `@attr` to a selector to read an
+attribute (e.g. `a.link@href`, `img@src`). `max_results` caps matches per field.
+## extract_structured — schema-driven (cost: 3)
+```json
+{
+  "tool": "extract_structured",
+  "params": {
+    "url": "https://jobs.example.com/post/123",
+    "schema": { "properties": { "title": { "type": "string" }, "salary": { "type": "string" } }, "required": ["title"] }
+  }
+}
+```
+Uses an LLM by default and falls back to CSS selectors when no LLM is
+configured (`fallbackToSelectors:true`). Provide `selectorHints` to guide the
+fallback. A schema with >3 required fields and no LLM configured triggers a
+confirmation prompt. CLI: `crawlforge extract <url> --schema schema.json`.
+## extract_with_llm — natural-language prompt (cost: 3)
+```json
+{
+  "tool": "extract_with_llm",
+  "params": { "url": "https://example.com/article", "prompt": "extract title, author, date, and a one-line summary", "provider": "ollama" }
+}
+```
+Defaults to a **local Ollama** model (`http://localhost:11434`, no API key).
+Run `list_ollama_models` first to see installed models, then pass one via
+`model` (e.g. `"llama3.2"`). Use `provider:"openai"` or `"anthropic"` with the
+matching key for a cloud model. CLI: `crawlforge extract <url> --prompt "..." --model llama3.2`.
+## process_document — PDF / DOCX / TXT (cost: 2)
+```json
+{ "tool": "process_document", "params": { "source": "https://example.com/report.pdf", "sourceType": "pdf_url" } }
+```
+`sourceType`: `url`, `pdf_url`, `file`, `pdf_file`. Returns structured sections,
+metadata, and word count. `options` passes through `maxPages`,
+`pageRange:{start,end}`, `password`, `outputFormat`, etc.
+## summarize_content & analyze_content
+```json
+{ "tool": "summarize_content", "params": { "text": "..long article..", "options": { "summaryLength": "short", "summaryType": "abstractive" } } }
+```
+```json
+{ "tool": "analyze_content", "params": { "text": "..article text..", "options": { "extractTopics": true, "includeSentiment": true } } }
+```
+`summarize_content` (cost 4) does extractive or abstractive summaries.
+`analyze_content` (cost 3) returns language, sentiment, entities, topics,
+keyword density, and readability. Both take raw `text` — feed them output from
+`extract_content` / `extract_text`.
+## LLM fallback chain
+For LLM-backed extraction (extract_structured, extract_with_llm, abstractive
+summarize): **Ollama (local, default) → OpenAI key → Anthropic key → MCP
+Sampling**. No cloud key is required; local Ollama is the zero-cost default.
+## Cost note
+`scrape_template` & `list_ollama_models` = 1 · `scrape_structured`,
+`process_document` = 2 · `extract_structured`, `extract_with_llm`,
+`analyze_content` = 3 · `summarize_content` = 4. Cloud LLM usage is billed
+separately by your provider, on top of credits.

package/src/skills/agent-skills/crawlforge-structured-extraction/references/templates.md ADDED Viewed

@@ -0,0 +1,39 @@
+# scrape_template — Site Templates
+Pre-built extractors for well-known sites. Call
+`scrape_template({ template: "list" })` (or CLI `crawlforge template --list`)
+to enumerate at runtime. Cost: 1 credit per call.
+| Template ID | Target | Typical fields returned |
+|-------------|--------|-------------------------|
+| `amazon-product` | Amazon product page (`/dp/...`) | title, price, rating, review count, availability, images, features |
+| `linkedin-profile` | LinkedIn public profile | name, headline, location, current role, about, experience |
+| `github-repo` | GitHub repository | name, owner, description, stars, forks, language, topics, README excerpt |
+| `youtube-video` | YouTube watch page | title, channel, views, likes, published date, description |
+| `tweet` | A single tweet/X post | author, handle, text, timestamp, likes, reposts |
+| `reddit-thread` | Reddit thread | title, subreddit, author, score, top comments |
+| `hacker-news-front-page` | HN front page | ranked stories: title, URL, points, author, comment count |
+| `producthunt-launch` | Product Hunt launch | name, tagline, upvotes, maker, description |
+| `stackoverflow-question` | Stack Overflow question | title, tags, votes, question body, accepted answer |
+| `npm-package` | npm package page | name, version, description, weekly downloads, license, repo link |
+## Usage
+```json
+{ "tool": "scrape_template", "params": { "template": "amazon-product", "url": "https://amazon.com/dp/B0XXXXX" } }
+```
+Parameters:
+| Param | Type | Default | Notes |
+|-------|------|---------|-------|
+| `template` | string | — | A template ID above, or `list`. |
+| `url` | string (URL) | — | Required unless `template` is `list`. |
+| `timeout` | number | `15000` | 5000–60000 ms. |
+## When a template does not exist for the site
+Fall back, cheapest first:
+1. `scrape_structured` with your own CSS selectors (cost 2).
+2. `extract_structured` with a JSON schema (cost 3, LLM with CSS fallback).
+3. `extract_with_llm` with a natural-language prompt (cost 3).

package/src/skills/agent-skills/crawlforge-web-scraping/SKILL.md ADDED Viewed

@@ -0,0 +1,141 @@
+---
+name: crawlforge-web-scraping
+description: "Scrapes web pages and returns clean Markdown, HTML, plain text, links, or metadata using CrawlForge's scrape, fetch_url, extract_content, extract_text, extract_links, extract_metadata, map_site, and crawl_deep tools. Use when the user wants to scrape a URL, fetch a page, get the markdown or text of a website, extract links or metadata, read an article, map a site's URLs, generate a sitemap, or crawl a whole website or documentation site. Prefer the unified scrape tool to get multiple formats in one call; use map_site to discover URLs and crawl_deep to walk an entire site."
+metadata:
+  version: 4.8.0
+  source: crawlforge-mcp-server
+---
+# CrawlForge Web Scraping
+Fetch and clean web content with the CrawlForge MCP server. This skill covers
+single-page scraping (one URL → Markdown / HTML / text / links / metadata),
+site discovery (sitemaps and URL maps), and whole-site crawling.
+## When to use
+- "Scrape this URL" / "get the markdown of this page" / "read this article" → `scrape`
+- "Just give me the raw HTML / JSON / headers" → `fetch_url`
+- "Give me the clean article text, no nav or ads" → `extract_content`
+- "Get the plain text / markdown body" → `extract_text`
+- "List every link on this page" → `extract_links`
+- "Get the title / meta / Open Graph / SEO tags" → `extract_metadata`
+- "Map all the URLs on this site" / "generate a sitemap" → `map_site`
+- "Crawl the whole docs site" / "index every page" → `crawl_deep`
+## Tool selection (fastest path first)
+1. **`scrape`** is the default for single pages. One fetch returns many formats
+   at once — no N-request fan-out. Ask for exactly the formats you need.
+2. Use the **single-purpose basic tools** (`fetch_url`, `extract_text`,
+   `extract_links`, `extract_metadata`) when you only want one cheap thing.
+3. **`extract_content`** when you specifically want Readability-cleaned article
+   body (strips ads, nav, footers) — better than `extract_text` for articles.
+4. **`map_site`** before a crawl to estimate scope / find section URLs.
+5. **`crawl_deep`** to walk an entire site and (optionally) extract content.
+If a page returns 403/429, a CAPTCHA, or empty "enable JavaScript" content,
+switch to the **crawlforge-stealth-browsing** skill (`stealth_mode`).
+## scrape — unified multi-format (cost: 2)
+Get markdown + links + metadata in a single call:
+```json
+{
+  "tool": "scrape",
+  "params": {
+    "url": "https://example.com/article",
+    "formats": ["markdown", "links", "metadata"],
+    "onlyMainContent": true
+  }
+}
+```
+`formats` accepts: `"markdown"`, `"html"`, `"rawHtml"`, `"text"`, `"links"`,
+`"metadata"`, `"screenshot"`, or an object `{ "type": "json", "schema": {...},
+"prompt": "..." }` for LLM-structured extraction. Partial success is supported:
+a failing format adds a `warnings[]` entry instead of failing the whole call.
+`onlyMainContent` (default `true`) strips boilerplate via Readability.
+CLI equivalent:
+```bash
+crawlforge scrape https://example.com/article --extract --format markdown
+```
+## fetch_url — raw HTTP (cost: 1)
+```json
+{ "tool": "fetch_url", "params": { "url": "https://example.com", "timeout": 15000 } }
+```
+Returns headers + body. Good for JSON/XML APIs or as a first step before
+`extract_text` / `extract_content`. Supports `headers` (e.g. auth tokens).
+## extract_content vs extract_text
+```json
+{ "tool": "extract_content", "params": { "url": "https://blog.example.com/post" } }
+```
+```json
+{ "tool": "extract_text", "params": { "url": "https://example.com/article", "output_format": "markdown" } }
+```
+`extract_content` (cost 2) = Readability-cleaned article body, best for RAG.
+`extract_text` (cost 1) = faster, strips tags; pass `output_format:"markdown"`
+for RAG-friendly output.
+## extract_links / extract_metadata (cost: 1 each)
+```json
+{ "tool": "extract_links", "params": { "url": "https://example.com", "filter_external": true } }
+```
+```json
+{ "tool": "extract_metadata", "params": { "url": "https://example.com" } }
+```
+## map_site — discover URLs (cost: 2)
+```json
+{
+  "tool": "map_site",
+  "params": { "url": "https://example.com", "include_sitemap": true, "max_urls": 500 }
+}
+```
+Reads `sitemap.xml` when available. Pass `search:"pricing"` to rank discovered
+URLs by relevance and get a `ranked_urls:[{url,score}]` array (default output
+is unchanged when `search` is omitted).
+CLI: `crawlforge map https://example.com --pretty` (add `--format xml` for a sitemap file).
+## crawl_deep — walk a whole site (cost: 4, scales with pages)
+```json
+{
+  "tool": "crawl_deep",
+  "params": {
+    "url": "https://docs.example.com",
+    "max_depth": 3,
+    "max_pages": 200,
+    "extract_content": true
+  }
+}
+```
+BFS crawl up to depth 5 / 1000 pages. Use `include_patterns` /
+`exclude_patterns` (regex) to scope, `respect_robots` to honor robots.txt
+(on by default). Crawls projected over ~500 pages trigger a confirmation
+prompt (elicitation). CLI: `crawlforge crawl https://docs.example.com --depth 3 --max-pages 200`.
+## Cost note
+Cheapest first: `fetch_url`, `extract_text`, `extract_links`,
+`extract_metadata` = 1 credit · `scrape`, `extract_content`, `map_site` =
+2 · `crawl_deep` = 4 (grows with `max_pages`). Prefer one `scrape` call over
+several single-format calls when you need multiple formats.
+See [tool reference](references/tool-reference.md) for the full parameter list.

package/src/skills/agent-skills/crawlforge-web-scraping/references/tool-reference.md ADDED Viewed

@@ -0,0 +1,95 @@
+# Web Scraping & Crawl Tools — Parameter Reference
+Authoritative parameters for the scraping/discovery/crawl tools. Costs are in
+CrawlForge credits.
+## scrape (cost: 2)
+| Param | Type | Default | Notes |
+|-------|------|---------|-------|
+| `url` | string (URL) | — | Required. |
+| `formats` | array | `["markdown"]` | Any of `markdown`, `html`, `rawHtml`, `text`, `links`, `metadata`, `screenshot`, or `{type:"json", schema?, prompt?}`. |
+| `onlyMainContent` | boolean | `true` | Strip boilerplate via Readability. |
+| `timeoutMs` | number | `15000` | 1000–60000. |
+Partial success: a format that fails produces a `warnings[]` entry rather than
+failing the whole call. `{type:"json"}` may incur external LLM cost (billed by
+your provider).
+## fetch_url (cost: 1)
+| Param | Type | Default | Notes |
+|-------|------|---------|-------|
+| `url` | string (URL) | — | Required. |
+| `headers` | object | — | Custom HTTP headers (e.g. auth tokens). |
+| `timeout` | number | `10000` | 1000–30000 ms. |
+## extract_text (cost: 1)
+| Param | Type | Default | Notes |
+|-------|------|---------|-------|
+| `url` | string (URL) | — | Required. |
+| `remove_scripts` | boolean | `true` | Strip `script` tags. |
+| `remove_styles` | boolean | `true` | Strip `style` tags. |
+| `output_format` | enum | `text` | `text` or `markdown` (use markdown for RAG). |
+## extract_links (cost: 1)
+| Param | Type | Default | Notes |
+|-------|------|---------|-------|
+| `url` | string (URL) | — | Required. |
+| `filter_external` | boolean | `false` | Only return outbound links. |
+| `base_url` | string (URL) | — | Resolve relative links against this. |
+## extract_metadata (cost: 1)
+| Param | Type | Notes |
+|-------|------|-------|
+| `url` | string (URL) | Required. Returns title, description, OG tags, canonical, schema.org. |
+## extract_content (cost: 2)
+| Param | Type | Notes |
+|-------|------|-------|
+| `url` | string (URL) | Required. Readability-cleaned article body (markdown). |
+| `options` | object | Additional extraction options. |
+## map_site (cost: 2)
+| Param | Type | Default | Notes |
+|-------|------|---------|-------|
+| `url` | string (URL) | — | Required. |
+| `include_sitemap` | boolean | — | Include `sitemap.xml` data. |
+| `max_urls` | number | — | 1–10000. |
+| `group_by_path` | boolean | — | Group URLs by path segment. |
+| `include_metadata` | boolean | — | Per-URL metadata. |
+| `domain_filter` | object | — | `whitelist`, `blacklist`, `include_patterns`, `exclude_patterns`. |
+| `search` | string | — | Rank URLs by relevance; emits `ranked_urls:[{url,score}]`. |
+## crawl_deep (cost: 4, scales with max_pages)
+| Param | Type | Default | Notes |
+|-------|------|---------|-------|
+| `url` | string (URL) | — | Required start URL. |
+| `max_depth` | number | — | 1–5. |
+| `max_pages` | number | — | 1–1000. |
+| `include_patterns` | string[] | — | Regex URL allow-list. |
+| `exclude_patterns` | string[] | — | Regex URL deny-list. |
+| `follow_external` | boolean | — | Follow off-domain links. |
+| `respect_robots` | boolean | `true`* | Honor robots.txt. |
+| `extract_content` | boolean | — | Extract page content during crawl. |
+| `content_max_length` | number | `500` | Max chars per page; sets a `truncated` flag. |
+| `concurrency` | number | — | 1–20 concurrent requests. |
+| `enable_link_analysis` | boolean | — | Compute PageRank over crawled pages. |
+| `session` | object | — | Shared cookie jar for login-then-crawl. |
+*Server default honors `RESPECT_ROBOTS_TXT`. Crawls projected over ~500 pages
+trigger an elicitation confirmation.
+## CLI quick map
+| Tool | CLI |
+|------|-----|
+| `scrape` / `fetch_url` | `crawlforge scrape <url> [--extract --format markdown]` |
+| `map_site` | `crawlforge map <url> [--format xml]` |
+| `crawl_deep` | `crawlforge crawl <url> --depth 3 --max-pages 200` |