wet-mcp 1.0.0__tar.gz → 1.1.0__tar.gz

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
@@ -1,6 +1,6 @@
1
1
  Metadata-Version: 2.4
2
2
  Name: wet-mcp
3
- Version: 1.0.0
3
+ Version: 1.1.0
4
4
  Summary: Open-source MCP Server thay thế Tavily - Web search, extract, crawl với SearXNG
5
5
  Project-URL: Homepage, https://github.com/n24q02m/wet-mcp
6
6
  Project-URL: Repository, https://github.com/n24q02m/wet-mcp.git
@@ -21,6 +21,7 @@ Classifier: Topic :: Scientific/Engineering :: Artificial Intelligence
21
21
  Requires-Python: ==3.13.*
22
22
  Requires-Dist: crawl4ai>=0.8.0
23
23
  Requires-Dist: httpx>=0.27.0
24
+ Requires-Dist: litellm>=1.0.0
24
25
  Requires-Dist: loguru>=0.7.0
25
26
  Requires-Dist: mcp[cli]>=1.0.0
26
27
  Requires-Dist: pydantic-settings>=2.0.0
@@ -33,27 +34,27 @@ Description-Content-Type: text/markdown
33
34
  [![PyPI version](https://badge.fury.io/py/wet-mcp.svg)](https://badge.fury.io/py/wet-mcp)
34
35
  [![License: MIT](https://img.shields.io/badge/License-MIT-yellow.svg)](https://opensource.org/licenses/MIT)
35
36
 
36
- > **Open-source MCP Server thay thế Tavily cho web scraping & multimodal extraction**
37
+ > **Open-source MCP Server replacing Tavily for web scraping & multimodal extraction**
37
38
 
38
- Zero-install experience: chỉ cần `uvx wet-mcp` - tự động setup quản lý SearXNG container.
39
+ Zero-install experience: just `uvx wet-mcp` - automatically setups and manages SearXNG container.
39
40
 
40
41
  ## Features
41
42
 
42
43
  | Feature | Description |
43
44
  |:--------|:------------|
44
- | **Web Search** | Tìm kiếm qua SearXNG (metasearch: Google, Bing, DuckDuckGo, Brave) |
45
- | **Content Extract** | Trích xuất nội dung sạch (Markdown/Text/HTML) |
46
- | **Deep Crawl** | Đi qua nhiều trang con từ URL gốc với depth control |
47
- | **Site Map** | Khám phá cấu trúc URL của website |
48
- | **Media** | List download images, videos, audio files |
49
- | **Anti-bot** | Stealth mode bypass Cloudflare, Medium, LinkedIn, Twitter |
45
+ | **Web Search** | Search via SearXNG (metasearch: Google, Bing, DuckDuckGo, Brave) |
46
+ | **Content Extract** | Extract clean content (Markdown/Text/HTML) |
47
+ | **Deep Crawl** | Crawl multiple pages from a root URL with depth control |
48
+ | **Site Map** | Discover website URL structure |
49
+ | **Media** | List and download images, videos, audio files |
50
+ | **Anti-bot** | Stealth mode bypasses Cloudflare, Medium, LinkedIn, Twitter |
50
51
 
51
52
  ## Quick Start
52
53
 
53
54
  ### Prerequisites
54
55
 
55
56
  - Docker daemon running (for SearXNG)
56
- - Python 3.13+ (hoặc dùng uvx)
57
+ - Python 3.13+ (or use uvx)
57
58
 
58
59
  ### MCP Client Configuration
59
60
 
@@ -70,11 +71,11 @@ Zero-install experience: chỉ cần `uvx wet-mcp` - tự động setup và qu
70
71
  }
71
72
  ```
72
73
 
73
- **Đó tất cả!** Khi MCP client gọi wet-mcp lần đầu:
74
- 1. Tự động install Playwright chromium
75
- 2. Tự động pull SearXNG Docker image
76
- 3. Start `wet-searxng` container
77
- 4. Chạy MCP server
74
+ **That's it!** When the MCP client calls `wet-mcp` for the first time:
75
+ 1. Automatically installs Playwright chromium
76
+ 2. Automatically pulls SearXNG Docker image
77
+ 3. Starts `wet-searxng` container
78
+ 4. Runs the MCP server
78
79
 
79
80
  ### Without uvx
80
81
 
@@ -137,9 +138,9 @@ wet-mcp
137
138
  │ ┌──────────┐ ┌──────────┐ ┌──────────────────────┐ │
138
139
  │ │ web │ │ media │ │ help │ │
139
140
  │ │ (search, │ │ (list, │ │ (full documentation)│ │
140
- │ │ extract, │ │ download)│ └──────────────────────┘ │
141
- │ │ crawl, │ └────┬─────┘
142
- │ │ map) │
141
+ │ │ extract, │ │ crawl, │ └──────────────────────┘ │
142
+ │ │ crawl, │ │ download)│
143
+ │ │ map) │ └────┬─────┘
143
144
  │ └────┬─────┘ │ │
144
145
  │ │ │ │
145
146
  │ ▼ ▼ │
@@ -3,27 +3,27 @@
3
3
  [![PyPI version](https://badge.fury.io/py/wet-mcp.svg)](https://badge.fury.io/py/wet-mcp)
4
4
  [![License: MIT](https://img.shields.io/badge/License-MIT-yellow.svg)](https://opensource.org/licenses/MIT)
5
5
 
6
- > **Open-source MCP Server thay thế Tavily cho web scraping & multimodal extraction**
6
+ > **Open-source MCP Server replacing Tavily for web scraping & multimodal extraction**
7
7
 
8
- Zero-install experience: chỉ cần `uvx wet-mcp` - tự động setup quản lý SearXNG container.
8
+ Zero-install experience: just `uvx wet-mcp` - automatically setups and manages SearXNG container.
9
9
 
10
10
  ## Features
11
11
 
12
12
  | Feature | Description |
13
13
  |:--------|:------------|
14
- | **Web Search** | Tìm kiếm qua SearXNG (metasearch: Google, Bing, DuckDuckGo, Brave) |
15
- | **Content Extract** | Trích xuất nội dung sạch (Markdown/Text/HTML) |
16
- | **Deep Crawl** | Đi qua nhiều trang con từ URL gốc với depth control |
17
- | **Site Map** | Khám phá cấu trúc URL của website |
18
- | **Media** | List download images, videos, audio files |
19
- | **Anti-bot** | Stealth mode bypass Cloudflare, Medium, LinkedIn, Twitter |
14
+ | **Web Search** | Search via SearXNG (metasearch: Google, Bing, DuckDuckGo, Brave) |
15
+ | **Content Extract** | Extract clean content (Markdown/Text/HTML) |
16
+ | **Deep Crawl** | Crawl multiple pages from a root URL with depth control |
17
+ | **Site Map** | Discover website URL structure |
18
+ | **Media** | List and download images, videos, audio files |
19
+ | **Anti-bot** | Stealth mode bypasses Cloudflare, Medium, LinkedIn, Twitter |
20
20
 
21
21
  ## Quick Start
22
22
 
23
23
  ### Prerequisites
24
24
 
25
25
  - Docker daemon running (for SearXNG)
26
- - Python 3.13+ (hoặc dùng uvx)
26
+ - Python 3.13+ (or use uvx)
27
27
 
28
28
  ### MCP Client Configuration
29
29
 
@@ -40,11 +40,11 @@ Zero-install experience: chỉ cần `uvx wet-mcp` - tự động setup và qu
40
40
  }
41
41
  ```
42
42
 
43
- **Đó tất cả!** Khi MCP client gọi wet-mcp lần đầu:
44
- 1. Tự động install Playwright chromium
45
- 2. Tự động pull SearXNG Docker image
46
- 3. Start `wet-searxng` container
47
- 4. Chạy MCP server
43
+ **That's it!** When the MCP client calls `wet-mcp` for the first time:
44
+ 1. Automatically installs Playwright chromium
45
+ 2. Automatically pulls SearXNG Docker image
46
+ 3. Starts `wet-searxng` container
47
+ 4. Runs the MCP server
48
48
 
49
49
  ### Without uvx
50
50
 
@@ -107,9 +107,9 @@ wet-mcp
107
107
  │ ┌──────────┐ ┌──────────┐ ┌──────────────────────┐ │
108
108
  │ │ web │ │ media │ │ help │ │
109
109
  │ │ (search, │ │ (list, │ │ (full documentation)│ │
110
- │ │ extract, │ │ download)│ └──────────────────────┘ │
111
- │ │ crawl, │ └────┬─────┘
112
- │ │ map) │
110
+ │ │ extract, │ │ crawl, │ └──────────────────────┘ │
111
+ │ │ crawl, │ │ download)│
112
+ │ │ map) │ └────┬─────┘
113
113
  │ └────┬─────┘ │ │
114
114
  │ │ │ │
115
115
  │ ▼ ▼ │
@@ -1,6 +1,6 @@
1
1
  [project]
2
2
  name = "wet-mcp"
3
- version = "1.0.0"
3
+ version = "1.1.0"
4
4
  description = "Open-source MCP Server thay thế Tavily - Web search, extract, crawl với SearXNG"
5
5
  readme = "README.md"
6
6
  license = { text = "MIT" }
@@ -32,6 +32,8 @@ dependencies = [
32
32
  "pydantic-settings>=2.0.0",
33
33
  # Logging
34
34
  "loguru>=0.7.0",
35
+ # LLM
36
+ "litellm>=1.0.0",
35
37
  ]
36
38
 
37
39
  [dependency-groups]
@@ -0,0 +1,75 @@
1
+ """Configuration settings for WET MCP Server."""
2
+
3
+ from pydantic_settings import BaseSettings
4
+
5
+
6
+ class Settings(BaseSettings):
7
+ """WET MCP Server configuration."""
8
+
9
+ # SearXNG
10
+ searxng_url: str = "http://localhost:8080"
11
+ searxng_timeout: int = 30
12
+
13
+ # Crawler
14
+ crawler_headless: bool = True
15
+ crawler_timeout: int = 60
16
+
17
+ # Docker Management
18
+ wet_auto_docker: bool = True
19
+ wet_container_name: str = "wet-searxng"
20
+ wet_searxng_image: str = "searxng/searxng:latest"
21
+ wet_searxng_port: int = 8080
22
+
23
+ # Media
24
+ download_dir: str = "~/.wet-mcp/downloads"
25
+
26
+ # Media Analysis (LiteLLM)
27
+ api_keys: str | None = None # provider:key,provider:key
28
+ llm_models: str = "gemini/gemini-2.0-flash-exp" # provider/model (fallback chain)
29
+
30
+ def setup_api_keys(self) -> dict[str, list[str]]:
31
+ """Parse API_KEYS and set environment variables for LiteLLM."""
32
+ if not self.api_keys:
33
+ return {}
34
+
35
+ import os
36
+
37
+ env_map = {
38
+ "gemini": "GOOGLE_API_KEY",
39
+ "openai": "OPENAI_API_KEY",
40
+ "anthropic": "ANTHROPIC_API_KEY",
41
+ "groq": "GROQ_API_KEY",
42
+ "deepseek": "DEEPSEEK_API_KEY",
43
+ "mistral": "MISTRAL_API_KEY",
44
+ }
45
+
46
+ keys_by_provider: dict[str, list[str]] = {}
47
+
48
+ for pair in self.api_keys.split(","):
49
+ pair = pair.strip()
50
+ if ":" not in pair:
51
+ continue
52
+
53
+ provider, key = pair.split(":", 1)
54
+ provider = provider.strip().lower()
55
+ key = key.strip()
56
+
57
+ if not key:
58
+ continue
59
+
60
+ keys_by_provider.setdefault(provider, []).append(key)
61
+
62
+ # Set first key of each provider as env var
63
+ for provider, keys in keys_by_provider.items():
64
+ if provider in env_map and keys:
65
+ os.environ[env_map[provider]] = keys[0]
66
+
67
+ return keys_by_provider
68
+
69
+ # Logging
70
+ log_level: str = "INFO"
71
+
72
+ model_config = {"env_prefix": "", "case_sensitive": False}
73
+
74
+
75
+ settings = Settings()
@@ -141,6 +141,6 @@ def remove_searxng() -> None:
141
141
  container_name = settings.wet_container_name
142
142
  if docker.container.exists(container_name):
143
143
  logger.info(f"Removing container: {container_name}")
144
- docker.container.remove(container_name, force=True)
144
+ docker.container.remove(container_name, force=True) # type: ignore
145
145
  except Exception as e:
146
146
  logger.debug(f"Failed to remove container: {e}")
@@ -31,7 +31,8 @@ Scan a page and return media URLs with metadata.
31
31
  ---
32
32
 
33
33
  ### download
34
- Download specific media files to local storage.
34
+ Download specific media files to local storage for further analysis or processing.
35
+ Use this when you need to inspect the actual file content (e.g., sending an image to a Vision LLM).
35
36
 
36
37
  **Parameters:**
37
38
  - `media_urls` (required): List of media URLs to download
@@ -0,0 +1,84 @@
1
+ """LLM utilities for WET MCP Server using LiteLLM."""
2
+
3
+ import base64
4
+ import mimetypes
5
+ from pathlib import Path
6
+
7
+ from litellm import completion
8
+ from loguru import logger
9
+
10
+ from wet_mcp.config import settings
11
+
12
+
13
+ def get_llm_config() -> dict:
14
+ """Build LLM configuration with fallback."""
15
+ models = [m.strip() for m in settings.llm_models.split(",") if m.strip()]
16
+ if not models:
17
+ models = ["gemini/gemini-2.0-flash-exp"]
18
+
19
+ primary = models[0]
20
+ fallbacks = models[1:] if len(models) > 1 else None
21
+
22
+ # Temperature adjustment for reasoning models
23
+ # (Gemini 2/3 sometimes needs higher temp, but 1.5 is standard)
24
+ temperature = 0.1
25
+
26
+ return {
27
+ "model": primary,
28
+ "fallbacks": fallbacks,
29
+ "temperature": temperature,
30
+ }
31
+
32
+
33
+ def encode_image(image_path: str) -> str:
34
+ """Encode image to base64."""
35
+ with open(image_path, "rb") as image_file:
36
+ return base64.b64encode(image_file.read()).decode("utf-8")
37
+
38
+
39
+ async def analyze_media(
40
+ media_path: str, prompt: str = "Describe this image in detail."
41
+ ) -> str:
42
+ """Analyze media file using configured LLM."""
43
+ if not settings.api_keys:
44
+ return "Error: LLM analysis requires API_KEYS to be configured."
45
+
46
+ path_obj = Path(media_path)
47
+ if not path_obj.exists():
48
+ return f"Error: File not found at {media_path}"
49
+
50
+ # Determine mime type
51
+ mime_type, _ = mimetypes.guess_type(media_path)
52
+ if not mime_type or not mime_type.startswith("image/"):
53
+ return f"Error: Only image analysis is currently supported. Got {mime_type}"
54
+
55
+ try:
56
+ config = get_llm_config()
57
+ logger.info(f"Analyzing media with model: {config['model']}")
58
+
59
+ base64_image = encode_image(media_path)
60
+ data_url = f"data:{mime_type};base64,{base64_image}"
61
+
62
+ messages = [
63
+ {
64
+ "role": "user",
65
+ "content": [
66
+ {"type": "text", "text": prompt},
67
+ {"type": "image_url", "image_url": {"url": data_url}},
68
+ ],
69
+ }
70
+ ]
71
+
72
+ response = completion(
73
+ model=config["model"],
74
+ messages=messages,
75
+ fallbacks=config["fallbacks"],
76
+ temperature=config["temperature"],
77
+ )
78
+
79
+ content = response.choices[0].message.content
80
+ return content
81
+
82
+ except Exception as e:
83
+ logger.error(f"LLM analysis failed: {e}")
84
+ return f"Error analyzing media: {str(e)}"
@@ -106,11 +106,17 @@ async def media(
106
106
  media_urls: list[str] | None = None,
107
107
  output_dir: str | None = None,
108
108
  max_items: int = 10,
109
+ prompt: str = "Describe this image in detail.",
109
110
  ) -> str:
110
111
  """Media discovery and download.
111
112
  - list: Scan page, return URLs + metadata
112
113
  - download: Download specific files to local
113
- MCP client decides whether to analyze media.
114
+ - analyze: Analyze a local media file using configured LLM (requires API_KEYS)
115
+
116
+ Note: Downloading is intended for downstream analysis (e.g., passing to an LLM
117
+ or vision model). The MCP server provides the raw files; the MCP client
118
+ orchestrates the analysis.
119
+
114
120
  Use `help` tool for full documentation.
115
121
  """
116
122
  from wet_mcp.sources.crawler import download_media, list_media
@@ -133,8 +139,16 @@ async def media(
133
139
  output_dir=output_dir or settings.download_dir,
134
140
  )
135
141
 
142
+ case "analyze":
143
+ if not url:
144
+ return "Error: url (local path) is required for analyze action"
145
+
146
+ from wet_mcp.llm import analyze_media
147
+
148
+ return await analyze_media(media_path=url, prompt=prompt)
149
+
136
150
  case _:
137
- return f"Error: Unknown action '{action}'. Valid actions: list, download"
151
+ return f"Error: Unknown action '{action}'. Valid actions: list, download, analyze"
138
152
 
139
153
 
140
154
  @mcp.tool()
@@ -160,6 +174,9 @@ def main() -> None:
160
174
  # Run auto-setup on first start (installs Playwright, etc.)
161
175
  run_auto_setup()
162
176
 
177
+ # Setup LLM API Keys
178
+ settings.setup_api_keys()
179
+
163
180
  # Initialize SearXNG container
164
181
  searxng_url = _get_searxng_url()
165
182
  logger.info(f"SearXNG URL: {searxng_url}")
@@ -73,7 +73,9 @@ def run_auto_setup() -> bool:
73
73
  if result.returncode == 0:
74
74
  logger.debug(f"Docker available: v{result.stdout.strip()}")
75
75
  else:
76
- logger.info("Docker not running, will use external SearXNG URL if configured")
76
+ logger.info(
77
+ "Docker not running, will use external SearXNG URL if configured"
78
+ )
77
79
  except FileNotFoundError:
78
80
  logger.info("Docker not installed, will use external SearXNG URL if configured")
79
81
  except subprocess.TimeoutExpired:
@@ -41,28 +41,36 @@ async def extract(
41
41
  )
42
42
 
43
43
  if result.success:
44
- content = result.markdown if format == "markdown" else result.cleaned_html
45
- results.append({
46
- "url": url,
47
- "title": result.metadata.get("title", ""),
48
- "content": content,
49
- "links": {
50
- "internal": result.links.get("internal", [])[:20],
51
- "external": result.links.get("external", [])[:20],
52
- },
53
- })
44
+ content = (
45
+ result.markdown if format == "markdown" else result.cleaned_html
46
+ )
47
+ results.append(
48
+ {
49
+ "url": url,
50
+ "title": result.metadata.get("title", ""),
51
+ "content": content,
52
+ "links": {
53
+ "internal": result.links.get("internal", [])[:20],
54
+ "external": result.links.get("external", [])[:20],
55
+ },
56
+ }
57
+ )
54
58
  else:
55
- results.append({
56
- "url": url,
57
- "error": result.error_message or "Failed to extract",
58
- })
59
+ results.append(
60
+ {
61
+ "url": url,
62
+ "error": result.error_message or "Failed to extract",
63
+ }
64
+ )
59
65
 
60
66
  except Exception as e:
61
67
  logger.error(f"Error extracting {url}: {e}")
62
- results.append({
63
- "url": url,
64
- "error": str(e),
65
- })
68
+ results.append(
69
+ {
70
+ "url": url,
71
+ "error": str(e),
72
+ }
73
+ )
66
74
 
67
75
  logger.info(f"Extracted {len(results)} pages")
68
76
  return json.dumps(results, ensure_ascii=False, indent=2)
@@ -118,20 +126,30 @@ async def crawl(
118
126
  )
119
127
 
120
128
  if result.success:
121
- content = result.markdown if format == "markdown" else result.cleaned_html
122
- all_results.append({
123
- "url": url,
124
- "depth": current_depth,
125
- "title": result.metadata.get("title", ""),
126
- "content": content[:5000], # Limit content size
127
- })
129
+ content = (
130
+ result.markdown
131
+ if format == "markdown"
132
+ else result.cleaned_html
133
+ )
134
+ all_results.append(
135
+ {
136
+ "url": url,
137
+ "depth": current_depth,
138
+ "title": result.metadata.get("title", ""),
139
+ "content": content[:5000], # Limit content size
140
+ }
141
+ )
128
142
 
129
143
  # Add internal links for next depth
130
144
  if current_depth < depth:
131
145
  internal_links = result.links.get("internal", [])
132
146
  for link_item in internal_links[:10]:
133
147
  # Crawl4AI returns dicts with 'href' key
134
- link_url = link_item.get("href", "") if isinstance(link_item, dict) else link_item
148
+ link_url = (
149
+ link_item.get("href", "")
150
+ if isinstance(link_item, dict)
151
+ else link_item
152
+ )
135
153
  if link_url and link_url not in visited:
136
154
  to_crawl.append((link_url, current_depth + 1))
137
155
 
@@ -279,18 +297,22 @@ async def download_media(
279
297
 
280
298
  filepath.write_bytes(response.content)
281
299
 
282
- results.append({
283
- "url": url,
284
- "path": str(filepath),
285
- "size": len(response.content),
286
- })
300
+ results.append(
301
+ {
302
+ "url": url,
303
+ "path": str(filepath),
304
+ "size": len(response.content),
305
+ }
306
+ )
287
307
 
288
308
  except Exception as e:
289
309
  logger.error(f"Error downloading {url}: {e}")
290
- results.append({
291
- "url": url,
292
- "error": str(e),
293
- })
310
+ results.append(
311
+ {
312
+ "url": url,
313
+ "error": str(e),
314
+ }
315
+ )
294
316
 
295
317
  logger.info(f"Downloaded {len([r for r in results if 'path' in r])} files")
296
318
  return json.dumps(results, ensure_ascii=False, indent=2)
@@ -45,12 +45,14 @@ async def search(
45
45
  # Format results
46
46
  formatted = []
47
47
  for r in results:
48
- formatted.append({
49
- "url": r.get("url", ""),
50
- "title": r.get("title", ""),
51
- "snippet": r.get("content", ""),
52
- "source": r.get("engine", ""),
53
- })
48
+ formatted.append(
49
+ {
50
+ "url": r.get("url", ""),
51
+ "title": r.get("title", ""),
52
+ "snippet": r.get("content", ""),
53
+ "source": r.get("engine", ""),
54
+ }
55
+ )
54
56
 
55
57
  output = {
56
58
  "results": formatted,
@@ -1,32 +0,0 @@
1
- """Configuration settings for WET MCP Server."""
2
-
3
- from pydantic_settings import BaseSettings
4
-
5
-
6
- class Settings(BaseSettings):
7
- """WET MCP Server configuration."""
8
-
9
- # SearXNG
10
- searxng_url: str = "http://localhost:8080"
11
- searxng_timeout: int = 30
12
-
13
- # Crawler
14
- crawler_headless: bool = True
15
- crawler_timeout: int = 60
16
-
17
- # Docker Management
18
- wet_auto_docker: bool = True
19
- wet_container_name: str = "wet-searxng"
20
- wet_searxng_image: str = "searxng/searxng:latest"
21
- wet_searxng_port: int = 8080
22
-
23
- # Media
24
- download_dir: str = "~/.wet-mcp/downloads"
25
-
26
- # Logging
27
- log_level: str = "INFO"
28
-
29
- model_config = {"env_prefix": "", "case_sensitive": False}
30
-
31
-
32
- settings = Settings()
File without changes
File without changes
File without changes
File without changes
File without changes