PyPI - wet-mcp - Versions diffs - 1.0.0__tar.gz → 1.1.0__tar.gz - Mend

wet-mcp 1.0.0tar.gz → 1.1.0tar.gz

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (21) hide show

{wet_mcp-1.0.0 → wet_mcp-1.1.0}/PKG-INFO RENAMED Viewed

@@ -1,6 +1,6 @@
 Metadata-Version: 2.4
 Name: wet-mcp
-Version: 1.0.0
+Version: 1.1.0
 Summary: Open-source MCP Server thay thế Tavily - Web search, extract, crawl với SearXNG
 Project-URL: Homepage, https://github.com/n24q02m/wet-mcp
 Project-URL: Repository, https://github.com/n24q02m/wet-mcp.git
@@ -21,6 +21,7 @@ Classifier: Topic :: Scientific/Engineering :: Artificial Intelligence
 Requires-Python: ==3.13.*
 Requires-Dist: crawl4ai>=0.8.0
 Requires-Dist: httpx>=0.27.0
+Requires-Dist: litellm>=1.0.0
 Requires-Dist: loguru>=0.7.0
 Requires-Dist: mcp[cli]>=1.0.0
 Requires-Dist: pydantic-settings>=2.0.0
@@ -33,27 +34,27 @@ Description-Content-Type: text/markdown
 [![PyPI version](https://badge.fury.io/py/wet-mcp.svg)](https://badge.fury.io/py/wet-mcp)
 [![License: MIT](https://img.shields.io/badge/License-MIT-yellow.svg)](https://opensource.org/licenses/MIT)
-> **Open-source MCP Server thay thế Tavily cho web scraping & multimodal extraction**
+> **Open-source MCP Server replacing Tavily for web scraping & multimodal extraction**
-Zero-install experience: chỉ cần `uvx wet-mcp` - tự động setup và quản lý SearXNG container.
+Zero-install experience: just `uvx wet-mcp` - automatically setups and manages SearXNG container.
 ## Features
 | Feature | Description |
 |:--------|:------------|
-| **Web Search** | Tìm kiếm qua SearXNG (metasearch: Google, Bing, DuckDuckGo, Brave) |
-| **Content Extract** | Trích xuất nội dung sạch (Markdown/Text/HTML) |
-| **Deep Crawl** | Đi qua nhiều trang con từ URL gốc với depth control |
-| **Site Map** | Khám phá cấu trúc URL của website |
-| **Media** | List và download images, videos, audio files |
-| **Anti-bot** | Stealth mode bypass Cloudflare, Medium, LinkedIn, Twitter |
+| **Web Search** | Search via SearXNG (metasearch: Google, Bing, DuckDuckGo, Brave) |
+| **Content Extract** | Extract clean content (Markdown/Text/HTML) |
+| **Deep Crawl** | Crawl multiple pages from a root URL with depth control |
+| **Site Map** | Discover website URL structure |
+| **Media** | List and download images, videos, audio files |
+| **Anti-bot** | Stealth mode bypasses Cloudflare, Medium, LinkedIn, Twitter |
 ## Quick Start
 ### Prerequisites
 - Docker daemon running (for SearXNG)
-- Python 3.13+ (hoặc dùng uvx)
+- Python 3.13+ (or use uvx)
 ### MCP Client Configuration
@@ -70,11 +71,11 @@ Zero-install experience: chỉ cần `uvx wet-mcp` - tự động setup và qu
 }
 ```
-**Đó là tất cả!** Khi MCP client gọi wet-mcp lần đầu:
-1. Tự động install Playwright chromium
-2. Tự động pull SearXNG Docker image
-3. Start `wet-searxng` container
-4. Chạy MCP server
+**That's it!** When the MCP client calls `wet-mcp` for the first time:
+1. Automatically installs Playwright chromium
+2. Automatically pulls SearXNG Docker image
+3. Starts `wet-searxng` container
+4. Runs the MCP server
 ### Without uvx
@@ -137,9 +138,9 @@ wet-mcp
 │  ┌──────────┐  ┌──────────┐  ┌──────────────────────┐   │
 │  │   web    │  │  media   │  │        help          │   │
 │  │ (search, │  │ (list,   │  │  (full documentation)│   │
-│  │ extract, │  │ download)│  └──────────────────────┘   │
-│  │ crawl,   │  └────┬─────┘                             │
-│  │ map)     │       │                                   │
+│  │ extract, │  │ crawl,   │  └──────────────────────┘   │
+│  │ crawl,   │  │ download)│                             │
+│  │ map)     │  └────┬─────┘                             │
 │  └────┬─────┘       │                                   │
 │       │             │                                   │
 │       ▼             ▼                                   │

{wet_mcp-1.0.0 → wet_mcp-1.1.0}/README.md RENAMED Viewed

@@ -3,27 +3,27 @@
 [![PyPI version](https://badge.fury.io/py/wet-mcp.svg)](https://badge.fury.io/py/wet-mcp)
 [![License: MIT](https://img.shields.io/badge/License-MIT-yellow.svg)](https://opensource.org/licenses/MIT)
-> **Open-source MCP Server thay thế Tavily cho web scraping & multimodal extraction**
+> **Open-source MCP Server replacing Tavily for web scraping & multimodal extraction**
-Zero-install experience: chỉ cần `uvx wet-mcp` - tự động setup và quản lý SearXNG container.
+Zero-install experience: just `uvx wet-mcp` - automatically setups and manages SearXNG container.
 ## Features
 | Feature | Description |
 |:--------|:------------|
-| **Web Search** | Tìm kiếm qua SearXNG (metasearch: Google, Bing, DuckDuckGo, Brave) |
-| **Content Extract** | Trích xuất nội dung sạch (Markdown/Text/HTML) |
-| **Deep Crawl** | Đi qua nhiều trang con từ URL gốc với depth control |
-| **Site Map** | Khám phá cấu trúc URL của website |
-| **Media** | List và download images, videos, audio files |
-| **Anti-bot** | Stealth mode bypass Cloudflare, Medium, LinkedIn, Twitter |
+| **Web Search** | Search via SearXNG (metasearch: Google, Bing, DuckDuckGo, Brave) |
+| **Content Extract** | Extract clean content (Markdown/Text/HTML) |
+| **Deep Crawl** | Crawl multiple pages from a root URL with depth control |
+| **Site Map** | Discover website URL structure |
+| **Media** | List and download images, videos, audio files |
+| **Anti-bot** | Stealth mode bypasses Cloudflare, Medium, LinkedIn, Twitter |
 ## Quick Start
 ### Prerequisites
 - Docker daemon running (for SearXNG)
-- Python 3.13+ (hoặc dùng uvx)
+- Python 3.13+ (or use uvx)
 ### MCP Client Configuration
@@ -40,11 +40,11 @@ Zero-install experience: chỉ cần `uvx wet-mcp` - tự động setup và qu
 }
 ```
-**Đó là tất cả!** Khi MCP client gọi wet-mcp lần đầu:
-1. Tự động install Playwright chromium
-2. Tự động pull SearXNG Docker image
-3. Start `wet-searxng` container
-4. Chạy MCP server
+**That's it!** When the MCP client calls `wet-mcp` for the first time:
+1. Automatically installs Playwright chromium
+2. Automatically pulls SearXNG Docker image
+3. Starts `wet-searxng` container
+4. Runs the MCP server
 ### Without uvx
@@ -107,9 +107,9 @@ wet-mcp
 │  ┌──────────┐  ┌──────────┐  ┌──────────────────────┐   │
 │  │   web    │  │  media   │  │        help          │   │
 │  │ (search, │  │ (list,   │  │  (full documentation)│   │
-│  │ extract, │  │ download)│  └──────────────────────┘   │
-│  │ crawl,   │  └────┬─────┘                             │
-│  │ map)     │       │                                   │
+│  │ extract, │  │ crawl,   │  └──────────────────────┘   │
+│  │ crawl,   │  │ download)│                             │
+│  │ map)     │  └────┬─────┘                             │
 │  └────┬─────┘       │                                   │
 │       │             │                                   │
 │       ▼             ▼                                   │

{wet_mcp-1.0.0 → wet_mcp-1.1.0}/pyproject.toml RENAMED Viewed

@@ -1,6 +1,6 @@
 [project]
 name = "wet-mcp"
-version = "1.0.0"
+version = "1.1.0"
 description = "Open-source MCP Server thay thế Tavily - Web search, extract, crawl với SearXNG"
 readme = "README.md"
 license = { text = "MIT" }
@@ -32,6 +32,8 @@ dependencies = [
     "pydantic-settings>=2.0.0",
     # Logging
     "loguru>=0.7.0",
+    # LLM
+    "litellm>=1.0.0",
 ]
 [dependency-groups]

wet_mcp-1.1.0/src/wet_mcp/config.py ADDED Viewed

@@ -0,0 +1,75 @@
+"""Configuration settings for WET MCP Server."""
+from pydantic_settings import BaseSettings
+class Settings(BaseSettings):
+    """WET MCP Server configuration."""
+    # SearXNG
+    searxng_url: str = "http://localhost:8080"
+    searxng_timeout: int = 30
+    # Crawler
+    crawler_headless: bool = True
+    crawler_timeout: int = 60
+    # Docker Management
+    wet_auto_docker: bool = True
+    wet_container_name: str = "wet-searxng"
+    wet_searxng_image: str = "searxng/searxng:latest"
+    wet_searxng_port: int = 8080
+    # Media
+    download_dir: str = "~/.wet-mcp/downloads"
+    # Media Analysis (LiteLLM)
+    api_keys: str | None = None  # provider:key,provider:key
+    llm_models: str = "gemini/gemini-2.0-flash-exp"  # provider/model (fallback chain)
+    def setup_api_keys(self) -> dict[str, list[str]]:
+        """Parse API_KEYS and set environment variables for LiteLLM."""
+        if not self.api_keys:
+            return {}
+        import os
+        env_map = {
+            "gemini": "GOOGLE_API_KEY",
+            "openai": "OPENAI_API_KEY",
+            "anthropic": "ANTHROPIC_API_KEY",
+            "groq": "GROQ_API_KEY",
+            "deepseek": "DEEPSEEK_API_KEY",
+            "mistral": "MISTRAL_API_KEY",
+        }
+        keys_by_provider: dict[str, list[str]] = {}
+        for pair in self.api_keys.split(","):
+            pair = pair.strip()
+            if ":" not in pair:
+                continue
+            provider, key = pair.split(":", 1)
+            provider = provider.strip().lower()
+            key = key.strip()
+            if not key:
+                continue
+            keys_by_provider.setdefault(provider, []).append(key)
+        # Set first key of each provider as env var
+        for provider, keys in keys_by_provider.items():
+            if provider in env_map and keys:
+                os.environ[env_map[provider]] = keys[0]
+        return keys_by_provider
+    # Logging
+    log_level: str = "INFO"
+    model_config = {"env_prefix": "", "case_sensitive": False}
+settings = Settings()

{wet_mcp-1.0.0 → wet_mcp-1.1.0}/src/wet_mcp/docker_manager.py RENAMED Viewed

@@ -141,6 +141,6 @@ def remove_searxng() -> None:
         container_name = settings.wet_container_name
         if docker.container.exists(container_name):
             logger.info(f"Removing container: {container_name}")
-            docker.container.remove(container_name, force=True)
+            docker.container.remove(container_name, force=True)  # type: ignore
     except Exception as e:
         logger.debug(f"Failed to remove container: {e}")

{wet_mcp-1.0.0 → wet_mcp-1.1.0}/src/wet_mcp/docs/media.md RENAMED Viewed

@@ -31,7 +31,8 @@ Scan a page and return media URLs with metadata.
 ---
 ### download
-Download specific media files to local storage.
+Download specific media files to local storage for further analysis or processing.
+Use this when you need to inspect the actual file content (e.g., sending an image to a Vision LLM).
 **Parameters:**
 - `media_urls` (required): List of media URLs to download

wet_mcp-1.1.0/src/wet_mcp/llm.py ADDED Viewed

@@ -0,0 +1,84 @@
+"""LLM utilities for WET MCP Server using LiteLLM."""
+import base64
+import mimetypes
+from pathlib import Path
+from litellm import completion
+from loguru import logger
+from wet_mcp.config import settings
+def get_llm_config() -> dict:
+    """Build LLM configuration with fallback."""
+    models = [m.strip() for m in settings.llm_models.split(",") if m.strip()]
+    if not models:
+        models = ["gemini/gemini-2.0-flash-exp"]
+    primary = models[0]
+    fallbacks = models[1:] if len(models) > 1 else None
+    # Temperature adjustment for reasoning models
+    # (Gemini 2/3 sometimes needs higher temp, but 1.5 is standard)
+    temperature = 0.1
+    return {
+        "model": primary,
+        "fallbacks": fallbacks,
+        "temperature": temperature,
+    }
+def encode_image(image_path: str) -> str:
+    """Encode image to base64."""
+    with open(image_path, "rb") as image_file:
+        return base64.b64encode(image_file.read()).decode("utf-8")
+async def analyze_media(
+    media_path: str, prompt: str = "Describe this image in detail."
+) -> str:
+    """Analyze media file using configured LLM."""
+    if not settings.api_keys:
+        return "Error: LLM analysis requires API_KEYS to be configured."
+    path_obj = Path(media_path)
+    if not path_obj.exists():
+        return f"Error: File not found at {media_path}"
+    # Determine mime type
+    mime_type, _ = mimetypes.guess_type(media_path)
+    if not mime_type or not mime_type.startswith("image/"):
+        return f"Error: Only image analysis is currently supported. Got {mime_type}"
+    try:
+        config = get_llm_config()
+        logger.info(f"Analyzing media with model: {config['model']}")
+        base64_image = encode_image(media_path)
+        data_url = f"data:{mime_type};base64,{base64_image}"
+        messages = [
+            {
+                "role": "user",
+                "content": [
+                    {"type": "text", "text": prompt},
+                    {"type": "image_url", "image_url": {"url": data_url}},
+                ],
+            }
+        ]
+        response = completion(
+            model=config["model"],
+            messages=messages,
+            fallbacks=config["fallbacks"],
+            temperature=config["temperature"],
+        )
+        content = response.choices[0].message.content
+        return content
+    except Exception as e:
+        logger.error(f"LLM analysis failed: {e}")
+        return f"Error analyzing media: {str(e)}"

{wet_mcp-1.0.0 → wet_mcp-1.1.0}/src/wet_mcp/server.py RENAMED Viewed

@@ -106,11 +106,17 @@ async def media(
     media_urls: list[str] | None = None,
     output_dir: str | None = None,
     max_items: int = 10,
+    prompt: str = "Describe this image in detail.",
 ) -> str:
     """Media discovery and download.
     - list: Scan page, return URLs + metadata
     - download: Download specific files to local
-    MCP client decides whether to analyze media.
+    - analyze: Analyze a local media file using configured LLM (requires API_KEYS)
+    Note: Downloading is intended for downstream analysis (e.g., passing to an LLM
+    or vision model). The MCP server provides the raw files; the MCP client
+    orchestrates the analysis.
     Use `help` tool for full documentation.
     """
     from wet_mcp.sources.crawler import download_media, list_media
@@ -133,8 +139,16 @@ async def media(
                 output_dir=output_dir or settings.download_dir,
             )
+        case "analyze":
+            if not url:
+                return "Error: url (local path) is required for analyze action"
+            from wet_mcp.llm import analyze_media
+            return await analyze_media(media_path=url, prompt=prompt)
         case _:
-            return f"Error: Unknown action '{action}'. Valid actions: list, download"
+            return f"Error: Unknown action '{action}'. Valid actions: list, download, analyze"
 @mcp.tool()
@@ -160,6 +174,9 @@ def main() -> None:
     # Run auto-setup on first start (installs Playwright, etc.)
     run_auto_setup()
+    # Setup LLM API Keys
+    settings.setup_api_keys()
     # Initialize SearXNG container
     searxng_url = _get_searxng_url()
     logger.info(f"SearXNG URL: {searxng_url}")

{wet_mcp-1.0.0 → wet_mcp-1.1.0}/src/wet_mcp/setup.py RENAMED Viewed

@@ -73,7 +73,9 @@ def run_auto_setup() -> bool:
         if result.returncode == 0:
             logger.debug(f"Docker available: v{result.stdout.strip()}")
         else:
-            logger.info("Docker not running, will use external SearXNG URL if configured")
+            logger.info(
+                "Docker not running, will use external SearXNG URL if configured"
+            )
     except FileNotFoundError:
         logger.info("Docker not installed, will use external SearXNG URL if configured")
     except subprocess.TimeoutExpired:

{wet_mcp-1.0.0 → wet_mcp-1.1.0}/src/wet_mcp/sources/crawler.py RENAMED Viewed

@@ -41,28 +41,36 @@ async def extract(
                 )
                 if result.success:
-                    content = result.markdown if format == "markdown" else result.cleaned_html
-                    results.append({
-                        "url": url,
-                        "title": result.metadata.get("title", ""),
-                        "content": content,
-                        "links": {
-                            "internal": result.links.get("internal", [])[:20],
-                            "external": result.links.get("external", [])[:20],
-                        },
-                    })
+                    content = (
+                        result.markdown if format == "markdown" else result.cleaned_html
+                    )
+                    results.append(
+                        {
+                            "url": url,
+                            "title": result.metadata.get("title", ""),
+                            "content": content,
+                            "links": {
+                                "internal": result.links.get("internal", [])[:20],
+                                "external": result.links.get("external", [])[:20],
+                            },
+                        }
+                    )
                 else:
-                    results.append({
-                        "url": url,
-                        "error": result.error_message or "Failed to extract",
-                    })
+                    results.append(
+                        {
+                            "url": url,
+                            "error": result.error_message or "Failed to extract",
+                        }
+                    )
             except Exception as e:
                 logger.error(f"Error extracting {url}: {e}")
-                results.append({
-                    "url": url,
-                    "error": str(e),
-                })
+                results.append(
+                    {
+                        "url": url,
+                        "error": str(e),
+                    }
+                )
     logger.info(f"Extracted {len(results)} pages")
     return json.dumps(results, ensure_ascii=False, indent=2)
@@ -118,20 +126,30 @@ async def crawl(
                     )
                     if result.success:
-                        content = result.markdown if format == "markdown" else result.cleaned_html
-                        all_results.append({
-                            "url": url,
-                            "depth": current_depth,
-                            "title": result.metadata.get("title", ""),
-                            "content": content[:5000],  # Limit content size
-                        })
+                        content = (
+                            result.markdown
+                            if format == "markdown"
+                            else result.cleaned_html
+                        )
+                        all_results.append(
+                            {
+                                "url": url,
+                                "depth": current_depth,
+                                "title": result.metadata.get("title", ""),
+                                "content": content[:5000],  # Limit content size
+                            }
+                        )
                         # Add internal links for next depth
                         if current_depth < depth:
                             internal_links = result.links.get("internal", [])
                             for link_item in internal_links[:10]:
                                 # Crawl4AI returns dicts with 'href' key
-                                link_url = link_item.get("href", "") if isinstance(link_item, dict) else link_item
+                                link_url = (
+                                    link_item.get("href", "")
+                                    if isinstance(link_item, dict)
+                                    else link_item
+                                )
                                 if link_url and link_url not in visited:
                                     to_crawl.append((link_url, current_depth + 1))
@@ -279,18 +297,22 @@ async def download_media(
                 filepath.write_bytes(response.content)
-                results.append({
-                    "url": url,
-                    "path": str(filepath),
-                    "size": len(response.content),
-                })
+                results.append(
+                    {
+                        "url": url,
+                        "path": str(filepath),
+                        "size": len(response.content),
+                    }
+                )
             except Exception as e:
                 logger.error(f"Error downloading {url}: {e}")
-                results.append({
-                    "url": url,
-                    "error": str(e),
-                })
+                results.append(
+                    {
+                        "url": url,
+                        "error": str(e),
+                    }
+                )
     logger.info(f"Downloaded {len([r for r in results if 'path' in r])} files")
     return json.dumps(results, ensure_ascii=False, indent=2)

{wet_mcp-1.0.0 → wet_mcp-1.1.0}/src/wet_mcp/sources/searxng.py RENAMED Viewed

@@ -45,12 +45,14 @@ async def search(
         # Format results
         formatted = []
         for r in results:
-            formatted.append({
-                "url": r.get("url", ""),
-                "title": r.get("title", ""),
-                "snippet": r.get("content", ""),
-                "source": r.get("engine", ""),
-            })
+            formatted.append(
+                {
+                    "url": r.get("url", ""),
+                    "title": r.get("title", ""),
+                    "snippet": r.get("content", ""),
+                    "source": r.get("engine", ""),
+                }
+            )
         output = {
             "results": formatted,

wet_mcp-1.0.0/src/wet_mcp/config.py DELETED Viewed

@@ -1,32 +0,0 @@
-"""Configuration settings for WET MCP Server."""
-from pydantic_settings import BaseSettings
-class Settings(BaseSettings):
-    """WET MCP Server configuration."""
-    # SearXNG
-    searxng_url: str = "http://localhost:8080"
-    searxng_timeout: int = 30
-    # Crawler
-    crawler_headless: bool = True
-    crawler_timeout: int = 60
-    # Docker Management
-    wet_auto_docker: bool = True
-    wet_container_name: str = "wet-searxng"
-    wet_searxng_image: str = "searxng/searxng:latest"
-    wet_searxng_port: int = 8080
-    # Media
-    download_dir: str = "~/.wet-mcp/downloads"
-    # Logging
-    log_level: str = "INFO"
-    model_config = {"env_prefix": "", "case_sensitive": False}
-settings = Settings()