searchfetch 1.0.2 → 3.0.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
package/LICENSE ADDED
@@ -0,0 +1,21 @@
1
+ MIT License
2
+
3
+ Copyright (c) 2026 Max
4
+
5
+ Permission is hereby granted, free of charge, to any person obtaining a copy
6
+ of this software and associated documentation files (the "Software"), to deal
7
+ in the Software without restriction, including without limitation the rights
8
+ to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
9
+ copies of the Software, and to permit persons to whom the Software is
10
+ furnished to do so, subject to the following conditions:
11
+
12
+ The above copyright notice and this permission notice shall be included in all
13
+ copies or substantial portions of the Software.
14
+
15
+ THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
16
+ IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
17
+ FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
18
+ AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
19
+ LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
20
+ OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
21
+ SOFTWARE.
package/README.md CHANGED
@@ -1,81 +1,96 @@
1
- # SearchFetch
1
+ # SearchFetch (MCP Server)
2
2
 
3
- A fault-tolerant, stealth-enabled Model Context Protocol (MCP) server for web searching and content fetching. Built specifically for AI Agents, it bypasses Google's GDPR consent screens, Cloudflare Turnstile, and converts heavy HTML into clean, token-optimized Markdown.
3
+ A maximum fault-tolerant, stealth-enabled Model Context Protocol (MCP) server for web searching and content fetching. Built specifically for AI Agents (Cursor, Claude Code, OpenCode), it completely bypasses bot detection (Cloudflare Turnstile, Datadome), dynamically handles SPAs/React, and converts bloat into token-optimized Markdown.
4
4
 
5
5
  ## Features
6
- * **Multi-Engine Search:** Natively supports DuckDuckGo and Google parsing out of the box. DuckDuckGo is set as the preferred default.
7
- * **Aggressive Base64 / Image Scrubber:** Implements "nuclear" DOM scrubbing prior to parsing. Guaranteed to NEVER pollute your LLM's context window with giant base64 image strings (`data:image/...`).
8
- * **Stealth CloakBrowser:** Avoids FingerprintJS, reCAPTCHA, and Cloudflare using Chromium C++ patches and humanized mouse movements natively.
9
- * **SPA & React Support:** Waits for network idle to ensure modern Single Page Applications fully execute JavaScript and render before extracting content.
10
- * **Fault Tolerant:** Extracts whatever DOM was successfully loaded even if a massive, clunky page times out mid-render.
11
- * **Pagination Support:** Fetches massive webpages iteratively via `start_index` and `max_length` without blowing out AI context tokens limits.
12
-
13
- ## Installation
14
-
15
- 1. Clone or copy the directory.
16
- ```bash
17
- git clone https://github.com/maxylev/searchfetch
18
- ```
19
- 2. Install dependencies:
20
- ```bash
21
- npm install
22
- ```
23
- 3. Make the main script executable:
24
- ```bash
25
- chmod +x index.js
26
- ```
27
- 4. Link it globally to your system:
28
- ```bash
29
- npm link
30
- ```
31
-
32
- ## Configuration
33
-
34
- Configure your AI tool/IDE (Cursor, Claude Desktop, Opencode, etc.) to point to this server.
35
-
36
- ### Example `config.json` (Opencode, Cursor):
6
+ * **Maximum Fault Tolerance:** Implements auto-healing browser sessions, grace-period timeouts for clunky SPAs, and network-level aborting of tracking scripts and media.
7
+ * **Stealth Engine:** Powered by CloakBrowser C++ patches + `humanize` logic. Antibot systems score it as a normal browser because it mathematically moves and renders exactly like one.
8
+ * **Nuclear Token Scrubber:** Strips Base64 images, SVGs, scripts, and inline styles out of the DOM *before* Markdown conversion, guaranteeing your LLM context window won't blow out.
9
+ * **Dual Execution Paths:** Natively supports zero-install execution via both Python (`uvx`) and Node.js (`npx`).
10
+
11
+ ---
12
+
13
+ ## Usage & Installation
14
+
15
+ You do not need to install this repository manually. Configure your agent to use the zero-install commands `npx` or `uvx` depending on your environment.
16
+
17
+ ### Claude Desktop Configuration
18
+ Add the following to your config:
19
+
20
+ **Option A: Using Python (`uvx` - Recommended)**
37
21
  ```json
38
22
  {
39
- "mcp": {
23
+ "mcpServers": {
40
24
  "searchfetch": {
41
- "type": "local",
42
- "command":["npx", "searchfetch"],
43
- "enabled": true
25
+ "command": "uvx",
26
+ "args": ["searchfetch"]
44
27
  }
45
28
  }
46
29
  }
47
30
  ```
48
31
 
49
- ### Example `claude_desktop_config.json`:
32
+ **Option B: Using Node.js (`npx`)**
50
33
  ```json
51
34
  {
52
35
  "mcpServers": {
53
36
  "searchfetch": {
54
37
  "command": "npx",
55
- "args": ["searchfetch"]
38
+ "args": ["-y", "searchfetch"]
56
39
  }
57
40
  }
58
41
  }
59
42
  ```
60
43
 
44
+ ### Cursor / IDE Configuration
45
+ Add it via the **MCP panel** in Cursor settings:
46
+ * **Type:** `command`
47
+ * **Command:** `uvx searchfetch` (or `npx -y searchfetch`)
48
+
49
+ ---
50
+
61
51
  ## Available Tools
62
52
 
63
53
  ### 1. `websearch`
64
- Searches the web via Google or DuckDuckGo and returns structured snippets.
65
- * **`query`** (string): Your search query.
66
- * **`engine`** (string): `"google"` or `"duckduckgo"` (default `"duckduckgo"`).
67
- * **`max_results`** (number): Number of results to return (default `10`).
54
+ Searches the web through the v3 template pipeline. DuckDuckGo and Google are built-in templates, and custom search templates can be selected by name.
55
+
56
+ **Parameters:**
57
+ * **`query`** *(string, required)*: The search query string.
58
+ * **`engine`** *(string, optional)*: Search engine/template to use. Can be `"duckduckgo"`, `"google"`, or a custom search template name. Default is `"duckduckgo"`.
59
+ * **`max_results`** *(number, optional)*: Maximum number of results to return. Default is `10`.
60
+ * **`region`** *(string/null, optional)*: Region and language code to localize search results.
61
+ * Examples: `"us-en"`, `"uk-en"`, `"de-de"`.
62
+ * For DuckDuckGo, it maps directly.
63
+ * For Google, it maps to the `gl` (country) and `hl` (language) query parameters automatically.
64
+ * `null` uses the template default.
65
+ * **`safe_search`** *(boolean/null, optional)*: Enable safe search. Maps to DuckDuckGo/Google parameters automatically; `null` uses the template default.
66
+ * **`block_media`** *(boolean, optional)*: Block images, media, and fonts at the network layer. Default is `true`.
68
67
 
69
68
  ### 2. `webfetch`
70
- Visits a URL as a stealthy human, waits for the JS to render, completely scrubs visual assets/inline styles to save tokens, and returns the markdown content.
71
- * **`url`** (string): Full HTTP/HTTPS link.
72
- * **`format`** (string): Set to `"markdown"` (default), `"clean_html"`, or `"raw_html"`.
73
- * **`start_index`** (number): Pagination offset.
74
- * **`max_length`** (number): Maximum character length to return per call (default `10000`).
75
- * **`block_media`** (boolean): Speeds up page loads by ignoring images, videos, and fonts entirely at the network layer (default `true`).
76
-
77
- ## Debugging
78
- If you want to debug JSON-RPC shapes locally:
69
+ Fetches a page with CloakBrowser and extracts structured Markdown using a named, inline, or auto-detected template. Built-ins include GitHub repositories/issues, npm, PyPI, crates.io, and ReadTheDocs-style docs pages. Unknown pages fall back to generic Markdown extraction.
70
+
71
+ **Parameters:**
72
+ * **`url`** *(string, required)*: The full URL of the webpage to fetch (must start with http/https).
73
+ * **`template`** *(string, optional)*: `"auto"`, a built-in template name, or inline JSON template. Default is `"auto"`.
74
+ * **`start_index`** *(number, optional)*: Character offset to start reading from for pagination. Use this if a document is too large to fit in the context window. Default is `0`.
75
+ * **`max_length`** *(number, optional)*: Maximum characters to return per request. Default is `10000`.
76
+ * **`block_media`** *(boolean, optional)*: Block images, videos, and fonts entirely at the network layer to drastically speed up page loads and dodge tracking pixels. Default is `true`.
77
+
78
+ Template extraction supports `text`, `markdown`, `attribute`, and `html` sections; nested children; repeated sections; URL decoding transforms; per-template cookies; and per-template resource blocking.
79
+
80
+ Built-in templates live in `templates/*.json` and are shared by the Node.js and Python implementations. Each JSON file defines exactly one template — no duplication between languages.
81
+
82
+ ---
83
+
84
+ ## Architecture & Contributions
85
+ This repository utilizes a flat dual-manifest file structure (`package.json` and `pyproject.toml` in the root). When committing changes, ensure parity between `index.js` and `server.py` logic.
86
+
87
+ ### Local Development
79
88
  ```bash
80
- npm run inspector
89
+ # Node.js Testing
90
+ npm i
91
+ npm run inspector-js
92
+
93
+ # Python Testing
94
+ pip install -e .
95
+ npm run inspector-py
81
96
  ```