PyPI - thordata-mcp-server - Versions diffs - 0.4.4__tar.gz → 0.5.0__tar.gz - Mend

thordata-mcp-server 0.4.4tar.gz → 0.5.0tar.gz

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (32) hide show

{thordata_mcp_server-0.4.4 → thordata_mcp_server-0.5.0}/PKG-INFO RENAMED Viewed

@@ -1,6 +1,6 @@
 Metadata-Version: 2.4
 Name: thordata-mcp-server
-Version: 0.4.4
+Version: 0.5.0
 Summary: Official MCP Server for Thordata.
 Author-email: Thordata Developer Team <support@thordata.com>
 License-Expression: MIT
@@ -8,7 +8,7 @@ Requires-Python: >=3.10
 Description-Content-Type: text/markdown
 Requires-Dist: mcp[cli]>=1.0.0
 Requires-Dist: sse-starlette>=1.6.1
-Requires-Dist: thordata-sdk>=1.6.0
+Requires-Dist: thordata-sdk>=1.7.0
 Requires-Dist: pydantic-settings
 Requires-Dist: markdownify
 Requires-Dist: html2text
@@ -23,14 +23,14 @@ Requires-Dist: uvicorn
 **Give your AI Agents real-time web scraping superpowers.**
-This MCP Server version has been **streamlined to focus on scraping**, concentrating on four core products:
+This MCP Server version has been **streamlined to focus on scraping**, concentrating on a compact, LLM‑friendly tool surface:
-- **SERP API** (Search result scraping)
+- **Search Engine** (LLM-friendly web search wrapper)
+- **SERP API** (Search result scraping, internal plumbing)
 - **Web Unlocker / Universal Scraper** (Universal page unlocking & scraping)
-- **Web Scraper API** (Structured task flow)
 - **Scraping Browser** (Browser-level scraping)
-Earlier versions exposed `proxy.*` / `account.*` / `proxy_users.*` proxy and account management tools. This version has removed these control plane interfaces, keeping only scraping-related capabilities for a clean tool surface in Cursor / MCP clients.
+Earlier versions exposed `proxy.*` / `account.*` / `proxy_users.*` proxy and account management tools, and a large `web_scraper` task surface. This version removes those control plane interfaces from MCP, keeping only scraping-related capabilities that are easy for LLMs to use.
 ## 🚀 Features
@@ -76,38 +76,15 @@ THORDATA_BROWSER_PASSWORD=your_password
 ### Tool Exposure Modes
-Current implementation provides **streamlined scraping tool surface only**, no longer exposing proxy and account management tools:
+Current implementation provides a **compact scraping tool surface**, optimized for Cursor / LLM tool callers:
-- **SERP SCRAPER**: `serp` (actions: `search`, `batch_search`)
-- **WEB UNLOCKER**: `unlocker` (actions: `fetch`, `batch_fetch`)
-- **WEB SCRAPER (100+ structured tasks + task management)**: `web_scraper` (actions: `catalog`, `groups`, `run`, `batch_run`, `status`, `status_batch`, `wait`, `result`, `result_batch`, `list_tasks`, `cancel`)
-- **BROWSER SCRAPER**: `browser` (actions: `navigate`, `snapshot`)
-- **Smart (auto tool + fallback)**: `smart_scrape`
+- **`search_engine`** (recommended for LLMs): high-level web search wrapper, returns a light `results[]` array with `title/link/description`. Internally delegates to the SERP backend.
+- **`search_engine_batch`**: batch variant of `search_engine` with per-item `ok/error` results.
+- **`unlocker`**: actions `fetch`, `batch_fetch` – universal page unlock & content extraction (HTML/Markdown), with per-item error reporting for batch.
+- **`browser`**: action `snapshot` – navigate (optional `url`) and capture an ARIA-focused snapshot for interactive elements.
+- **`smart_scrape`**: auto-picks the best scraper (SERP, Web Scraper, Unlocker) for a given URL and returns a unified, LLM-friendly response.
-> Note: This version focuses on scraping functionality and no longer includes `proxy.*` / `account.*` control plane tools.
-### Web Scraper discovery (100+ tools, no extra env required)
-Use `web_scraper` with `action="catalog"` / `action="groups"` to discover tools.
-This keeps Cursor/LLMs usable while still supporting **100+ tools** under a single entrypoint.
-```env
-# Default: curated + limit 60
-THORDATA_TASKS_LIST_MODE=curated
-THORDATA_TASKS_LIST_DEFAULT_LIMIT=60
-# Which groups are included when mode=curated
-THORDATA_TASKS_GROUPS=ecommerce,social,video,search,travel,code,professional
-# Optional safety/UX: restrict which tools can actually run
-# (comma-separated prefixes or exact tool keys)
-# Example:
-# THORDATA_TASKS_ALLOWLIST=thordata.tools.video.,thordata.tools.ecommerce.Amazon.ProductByAsin
-THORDATA_TASKS_ALLOWLIST=
-```
-If you want Cursor to **never** see the full 300+ tool list, keep `THORDATA_TASKS_LIST_MODE=curated`
-and optionally set `THORDATA_TASKS_ALLOWLIST` to the small subset you actually want to support.
+Internally, the server still uses structured SERP and Web Scraper capabilities, but they are not exposed as large tool surfaces by default to avoid overwhelming LLMs.
 ### Deployment (Optional)
@@ -162,19 +139,17 @@ Add this to your `claude_desktop_config.json`:
 Notes:
 - `THORDATA_BROWSER_USERNAME` / `THORDATA_BROWSER_PASSWORD` are required for `browser.*` tools (Scraping Browser).
-## 🛠️ Available Tools
-### Available Tools (All directly related to scraping)
+## 🛠️ Available Tools (Compact Surface)
-Current MCP Server only exposes the following **5 scraping-related tools**:
+By default, the MCP server exposes a **small, LLM-friendly tool set**:
-- **`serp`**: action `search`, `batch_search`
-- **`unlocker`**: action `fetch`, `batch_fetch`
-- **`web_scraper`**: action `catalog`, `groups`, `run`, `batch_run`, `status`, `status_batch`, `wait`, `result`, `result_batch`, `list_tasks`, `cancel`
-- **`browser`**: action `navigate`, `snapshot`
-- **`smart_scrape`**: auto-pick structured tool; fallback to unlocker
+- **`search_engine`**: single-query web search (`params.q`, optional `params.num`, `params.engine`).
+- **`search_engine_batch`**: batch web search with per-item `ok/error` in `results[]`.
+- **`unlocker`**: universal scraping via `fetch` / `batch_fetch`.
+- **`browser`**: `snapshot` with optional `url`, `max_items`, and `max_chars`.
+- **`smart_scrape`**: smart router for `url` with optional preview limit parameters.
-> Proxy network related APIs can still be used via other Thordata SDKs / HTTP APIs, but are not exposed through MCP to avoid introducing complex management operations in LLMs.
+Advanced / internal tools (e.g. low-level `serp.*`, full `web_scraper.*` surfaces, proxy/account control plane) remain available via HTTP APIs and SDKs, but are not exposed directly as MCP tools to keep the surface manageable for agents and LLMs.
 ## 🏗️ Architecture
@@ -189,14 +164,14 @@ thordata_mcp/
 ├── utils.py             # Common utilities (error handling, responses)
 ├── browser_session.py   # Browser session management (Playwright)
 ├── aria_snapshot.py     # ARIA snapshot filtering
-└── tools/
-    ├── product_compact.py  # Streamlined 5-tool entry point (serp/unlocker/web_scraper/browser/smart_scrape)
-    ├── product.py          # Full product implementation for internal use (reused by compact version)
-    ├── data/               # Data plane tools (only scraping-related namespaces retained)
-    │   ├── serp.py         # serp.*
-    │   ├── universal.py    # universal.*
-    │   ├── browser.py      # browser.*
-    │   └── tasks.py        # tasks.*
+    └── tools/
+        ├── product_compact.py  # Streamlined MCP entrypoint (search_engine / unlocker / browser / smart_scrape, plus batch variants)
+        ├── product.py          # Full product implementation for internal use (reused by compact version)
+        ├── data/               # Data plane tools (only scraping-related namespaces retained)
+        │   ├── serp.py         # SERP backend integration
+        │   ├── universal.py    # Universal / Unlocker backend integration
+        │   ├── browser.py      # Browser / Playwright helpers
+        │   └── tasks.py        # Structured scraping tasks (used by smart_scrape and internal flows)
 ```
 ## 🎯 Design Principles

{thordata_mcp_server-0.4.4 → thordata_mcp_server-0.5.0}/README.md RENAMED Viewed

@@ -2,14 +2,14 @@
 **Give your AI Agents real-time web scraping superpowers.**
-This MCP Server version has been **streamlined to focus on scraping**, concentrating on four core products:
+This MCP Server version has been **streamlined to focus on scraping**, concentrating on a compact, LLM‑friendly tool surface:
-- **SERP API** (Search result scraping)
+- **Search Engine** (LLM-friendly web search wrapper)
+- **SERP API** (Search result scraping, internal plumbing)
 - **Web Unlocker / Universal Scraper** (Universal page unlocking & scraping)
-- **Web Scraper API** (Structured task flow)
 - **Scraping Browser** (Browser-level scraping)
-Earlier versions exposed `proxy.*` / `account.*` / `proxy_users.*` proxy and account management tools. This version has removed these control plane interfaces, keeping only scraping-related capabilities for a clean tool surface in Cursor / MCP clients.
+Earlier versions exposed `proxy.*` / `account.*` / `proxy_users.*` proxy and account management tools, and a large `web_scraper` task surface. This version removes those control plane interfaces from MCP, keeping only scraping-related capabilities that are easy for LLMs to use.
 ## 🚀 Features
@@ -55,38 +55,15 @@ THORDATA_BROWSER_PASSWORD=your_password
 ### Tool Exposure Modes
-Current implementation provides **streamlined scraping tool surface only**, no longer exposing proxy and account management tools:
+Current implementation provides a **compact scraping tool surface**, optimized for Cursor / LLM tool callers:
-- **SERP SCRAPER**: `serp` (actions: `search`, `batch_search`)
-- **WEB UNLOCKER**: `unlocker` (actions: `fetch`, `batch_fetch`)
-- **WEB SCRAPER (100+ structured tasks + task management)**: `web_scraper` (actions: `catalog`, `groups`, `run`, `batch_run`, `status`, `status_batch`, `wait`, `result`, `result_batch`, `list_tasks`, `cancel`)
-- **BROWSER SCRAPER**: `browser` (actions: `navigate`, `snapshot`)
-- **Smart (auto tool + fallback)**: `smart_scrape`
+- **`search_engine`** (recommended for LLMs): high-level web search wrapper, returns a light `results[]` array with `title/link/description`. Internally delegates to the SERP backend.
+- **`search_engine_batch`**: batch variant of `search_engine` with per-item `ok/error` results.
+- **`unlocker`**: actions `fetch`, `batch_fetch` – universal page unlock & content extraction (HTML/Markdown), with per-item error reporting for batch.
+- **`browser`**: action `snapshot` – navigate (optional `url`) and capture an ARIA-focused snapshot for interactive elements.
+- **`smart_scrape`**: auto-picks the best scraper (SERP, Web Scraper, Unlocker) for a given URL and returns a unified, LLM-friendly response.
-> Note: This version focuses on scraping functionality and no longer includes `proxy.*` / `account.*` control plane tools.
-### Web Scraper discovery (100+ tools, no extra env required)
-Use `web_scraper` with `action="catalog"` / `action="groups"` to discover tools.
-This keeps Cursor/LLMs usable while still supporting **100+ tools** under a single entrypoint.
-```env
-# Default: curated + limit 60
-THORDATA_TASKS_LIST_MODE=curated
-THORDATA_TASKS_LIST_DEFAULT_LIMIT=60
-# Which groups are included when mode=curated
-THORDATA_TASKS_GROUPS=ecommerce,social,video,search,travel,code,professional
-# Optional safety/UX: restrict which tools can actually run
-# (comma-separated prefixes or exact tool keys)
-# Example:
-# THORDATA_TASKS_ALLOWLIST=thordata.tools.video.,thordata.tools.ecommerce.Amazon.ProductByAsin
-THORDATA_TASKS_ALLOWLIST=
-```
-If you want Cursor to **never** see the full 300+ tool list, keep `THORDATA_TASKS_LIST_MODE=curated`
-and optionally set `THORDATA_TASKS_ALLOWLIST` to the small subset you actually want to support.
+Internally, the server still uses structured SERP and Web Scraper capabilities, but they are not exposed as large tool surfaces by default to avoid overwhelming LLMs.
 ### Deployment (Optional)
@@ -141,19 +118,17 @@ Add this to your `claude_desktop_config.json`:
 Notes:
 - `THORDATA_BROWSER_USERNAME` / `THORDATA_BROWSER_PASSWORD` are required for `browser.*` tools (Scraping Browser).
-## 🛠️ Available Tools
-### Available Tools (All directly related to scraping)
+## 🛠️ Available Tools (Compact Surface)
-Current MCP Server only exposes the following **5 scraping-related tools**:
+By default, the MCP server exposes a **small, LLM-friendly tool set**:
-- **`serp`**: action `search`, `batch_search`
-- **`unlocker`**: action `fetch`, `batch_fetch`
-- **`web_scraper`**: action `catalog`, `groups`, `run`, `batch_run`, `status`, `status_batch`, `wait`, `result`, `result_batch`, `list_tasks`, `cancel`
-- **`browser`**: action `navigate`, `snapshot`
-- **`smart_scrape`**: auto-pick structured tool; fallback to unlocker
+- **`search_engine`**: single-query web search (`params.q`, optional `params.num`, `params.engine`).
+- **`search_engine_batch`**: batch web search with per-item `ok/error` in `results[]`.
+- **`unlocker`**: universal scraping via `fetch` / `batch_fetch`.
+- **`browser`**: `snapshot` with optional `url`, `max_items`, and `max_chars`.
+- **`smart_scrape`**: smart router for `url` with optional preview limit parameters.
-> Proxy network related APIs can still be used via other Thordata SDKs / HTTP APIs, but are not exposed through MCP to avoid introducing complex management operations in LLMs.
+Advanced / internal tools (e.g. low-level `serp.*`, full `web_scraper.*` surfaces, proxy/account control plane) remain available via HTTP APIs and SDKs, but are not exposed directly as MCP tools to keep the surface manageable for agents and LLMs.
 ## 🏗️ Architecture
@@ -168,14 +143,14 @@ thordata_mcp/
 ├── utils.py             # Common utilities (error handling, responses)
 ├── browser_session.py   # Browser session management (Playwright)
 ├── aria_snapshot.py     # ARIA snapshot filtering
-└── tools/
-    ├── product_compact.py  # Streamlined 5-tool entry point (serp/unlocker/web_scraper/browser/smart_scrape)
-    ├── product.py          # Full product implementation for internal use (reused by compact version)
-    ├── data/               # Data plane tools (only scraping-related namespaces retained)
-    │   ├── serp.py         # serp.*
-    │   ├── universal.py    # universal.*
-    │   ├── browser.py      # browser.*
-    │   └── tasks.py        # tasks.*
+    └── tools/
+        ├── product_compact.py  # Streamlined MCP entrypoint (search_engine / unlocker / browser / smart_scrape, plus batch variants)
+        ├── product.py          # Full product implementation for internal use (reused by compact version)
+        ├── data/               # Data plane tools (only scraping-related namespaces retained)
+        │   ├── serp.py         # SERP backend integration
+        │   ├── universal.py    # Universal / Unlocker backend integration
+        │   ├── browser.py      # Browser / Playwright helpers
+        │   └── tasks.py        # Structured scraping tasks (used by smart_scrape and internal flows)
 ```
 ## 🎯 Design Principles

{thordata_mcp_server-0.4.4 → thordata_mcp_server-0.5.0}/pyproject.toml RENAMED Viewed

@@ -4,7 +4,7 @@ build-backend = "setuptools.build_meta"
 [project]
 name = "thordata-mcp-server"
-version = "0.4.4"
+version = "0.5.0"
 description = "Official MCP Server for Thordata."
 authors = [{name = "Thordata Developer Team", email = "support@thordata.com"}]
 readme = "README.md"
@@ -13,7 +13,7 @@ license = "MIT"
 dependencies = [
     "mcp[cli]>=1.0.0",
     "sse-starlette>=1.6.1",
-    "thordata-sdk>=1.6.0",
+    "thordata-sdk>=1.7.0",
     "pydantic-settings",
     "markdownify",
     "html2text",

{thordata_mcp_server-0.4.4 → thordata_mcp_server-0.5.0}/src/thordata_mcp/__init__.py RENAMED Viewed

@@ -1,4 +1,4 @@
 """
 Thordata MCP Server package.
 """
-__version__ = "0.4.4"
+__version__ = "0.5.0"

{thordata_mcp_server-0.4.4 → thordata_mcp_server-0.5.0}/src/thordata_mcp/browser_session.py RENAMED Viewed

@@ -1,8 +1,7 @@
 """Browser session management for Thordata Scraping Browser.
 This module provides a high-level wrapper around Playwright connected to
-Thordata's Scraping Browser (via `AsyncThordataClient.get_browser_connection_url`),
-inspired by Bright Data's browser session design but implemented in Python.
+Thordata's Scraping Browser (via `AsyncThordataClient.get_browser_connection_url`).
 Design goals:
 - Domain-scoped browser sessions (one browser/page per domain).
@@ -18,6 +17,8 @@ from urllib.parse import urlparse
 from playwright.async_api import Browser, Page, Playwright, async_playwright
+import time
 from thordata.async_client import AsyncThordataClient
 from .aria_snapshot import AriaSnapshotFilter
@@ -37,6 +38,11 @@ class BrowserSession:
         self._requests: Dict[str, Dict[Any, Any]] = {}
         self._dom_refs: Set[str] = set()
         self._current_domain: str = "default"
+        # Console and network diagnostics cache
+        self._console_messages: Dict[str, List[Dict[str, Any]]] = {}
+        self._network_requests: Dict[str, List[Dict[str, Any]]] = {}
+        self._max_console_messages = 10
+        self._max_network_requests = 20
     @staticmethod
     def _get_domain(url: str) -> str:
@@ -139,11 +145,26 @@ class BrowserSession:
         # Reset network tracking for this domain
         self._requests[domain] = {}
+        self._console_messages[domain] = []
+        self._network_requests[domain] = []
         async def on_request(request: Any) -> None:
             if domain in self._requests:
                 self._requests[domain][request] = None
+            try:
+                self._network_requests.setdefault(domain, [])
+                self._network_requests[domain].append(
+                    {
+                        "url": request.url,
+                        "method": request.method,
+                        "resourceType": getattr(request, "resource_type", None),
+                        "timestamp": int(time.time() * 1000),
+                    }
+                )
+                self._network_requests[domain] = self._network_requests[domain][-self._max_network_requests :]
+            except Exception:
+                pass
         async def on_response(response: Any) -> None:
             if domain in self._requests:
                 try:
@@ -151,15 +172,78 @@ class BrowserSession:
                 except Exception:
                     # Best-effort, non-fatal
                     pass
+            try:
+                # Update last matching request with status
+                req = response.request
+                url = getattr(req, "url", None)
+                if url and domain in self._network_requests:
+                    for item in reversed(self._network_requests[domain]):
+                        if item.get("url") == url and item.get("statusCode") is None:
+                            item["statusCode"] = response.status
+                            break
+            except Exception:
+                pass
         page.on("request", on_request)
         page.on("response", on_response)
+        # Console message tracking
+        async def on_console(msg: Any) -> None:
+            try:
+                self._console_messages.setdefault(domain, [])
+                self._console_messages[domain].append(
+                    {
+                        "type": msg.type,
+                        "message": msg.text,
+                        "timestamp": int(time.time() * 1000),
+                    }
+                )
+                self._console_messages[domain] = self._console_messages[domain][-self._max_console_messages :]
+            except Exception:
+                pass
+        page.on("console", on_console)
         self._pages[domain] = page
         return page
-    async def capture_snapshot(self, filtered: bool = True) -> Dict[str, Any]:
-        """Capture an ARIA-like snapshot and optional DOM snapshot."""
+    def get_console_tail(self, n: int = 10, domain: Optional[str] = None) -> List[Dict[str, Any]]:
+        """Return recent console messages for the given domain."""
+        d = domain or self._current_domain
+        items = self._console_messages.get(d, [])
+        return items[-max(0, int(n)) :]
+    def get_network_tail(self, n: int = 20, domain: Optional[str] = None) -> List[Dict[str, Any]]:
+        """Return recent network request summaries for the given domain."""
+        d = domain or self._current_domain
+        items = self._network_requests.get(d, [])
+        return items[-max(0, int(n)) :]
+    def reset_page(self, domain: Optional[str] = None) -> None:
+        """Drop cached page for a domain so the next call recreates it."""
+        d = domain or self._current_domain
+        self._pages.pop(d, None)
+        self._requests.pop(d, None)
+        self._console_messages.pop(d, None)
+        self._network_requests.pop(d, None)
+    async def capture_snapshot(
+        self,
+        *,
+        filtered: bool = True,
+        mode: str = "compact",
+        max_items: int = 80,
+        include_dom: bool = False,
+    ) -> Dict[str, Any]:
+        """Capture an ARIA-like snapshot and optional DOM snapshot.
+        Args:
+            filtered: Whether to apply AriaSnapshotFilter (legacy, kept for compatibility).
+            mode: "compact" | "full". Compact returns minimal interactive elements.
+            max_items: Maximum number of interactive elements to include (compact mode only).
+            include_dom: Whether to include dom_snapshot (compact mode defaults to False).
+        """
         page = await self.get_page()
         try:
@@ -175,16 +259,64 @@ class BrowserSession:
                 "aria_snapshot": full_snapshot,
             }
+        if mode == "compact":
+            # Compact: return only filtered interactive elements, optionally without dom_snapshot
+            filtered_snapshot = AriaSnapshotFilter.filter_snapshot(full_snapshot)
+            filtered_snapshot = self._limit_aria_snapshot_items(filtered_snapshot, max_items=max_items)
+            dom_snapshot = None
+            if include_dom:
+                dom_snapshot_raw = await self._capture_dom_snapshot(page)
+                self._dom_refs = {el["ref"] for el in dom_snapshot_raw}
+                dom_snapshot = AriaSnapshotFilter.format_dom_elements(dom_snapshot_raw)
+            return {
+                "url": page.url,
+                "title": await page.title(),
+                "aria_snapshot": filtered_snapshot,
+                "dom_snapshot": dom_snapshot,
+                "_meta": {"mode": mode, "max_items": max_items, "include_dom": include_dom},
+            }
+        # Full mode: include both filtered aria and dom_snapshot (legacy behavior)
         filtered_snapshot = AriaSnapshotFilter.filter_snapshot(full_snapshot)
-        dom_snapshot = await self._capture_dom_snapshot(page)
-        self._dom_refs = {el["ref"] for el in dom_snapshot}
+        dom_snapshot_raw = await self._capture_dom_snapshot(page)
+        self._dom_refs = {el["ref"] for el in dom_snapshot_raw}
         return {
             "url": page.url,
             "title": await page.title(),
             "aria_snapshot": filtered_snapshot,
-            "dom_snapshot": AriaSnapshotFilter.format_dom_elements(dom_snapshot),
+            "dom_snapshot": AriaSnapshotFilter.format_dom_elements(dom_snapshot_raw),
+            "_meta": {"mode": mode},
         }
+    @staticmethod
+    def _limit_aria_snapshot_items(text: str, *, max_items: int) -> str:
+        """Limit snapshot to the first N interactive element blocks.
+        The snapshot format is a list where each element starts with a line beginning
+        with '- ' (Playwright raw) or '[' (AriaSnapshotFilter compact), and may include
+        one or more indented continuation lines.
+        """
+        try:
+            n = int(max_items)
+        except Exception:
+            n = 80
+        if n <= 0:
+            return ""
+        if not text:
+            return text
+        lines = text.splitlines()
+        out: list[str] = []
+        items = 0
+        for line in lines:
+            if line.startswith("- ") or line.startswith("["):
+                if items >= n:
+                    break
+                items += 1
+            # Include continuation lines only if we've started collecting items.
+            if items > 0:
+                out.append(line)
+        return "\n".join(out).strip()
     async def _get_interactive_snapshot(self, page: Page) -> str:
         """Generate a text snapshot of interactive elements with refs."""
@@ -194,12 +326,25 @@ class BrowserSession:
                 const lines = [];
                 let refCounter = 0;
+                function normalizeRole(tag, explicitRole) {
+                    const role = (explicitRole || '').toLowerCase();
+                    const t = (tag || '').toLowerCase();
+                    if (role) return role;
+                    // Map common interactive tags to standard ARIA roles
+                    if (t === 'a') return 'link';
+                    if (t === 'button') return 'button';
+                    if (t === 'input') return 'textbox';
+                    if (t === 'select') return 'combobox';
+                    if (t === 'textarea') return 'textbox';
+                    return t;
+                }
                 function traverse(node) {
                     if (node.nodeType === Node.ELEMENT_NODE) {
-                        const role = node.getAttribute('role') || node.tagName.toLowerCase();
                         const tag = node.tagName.toLowerCase();
                         const interactiveTag = ['a', 'button', 'input', 'select', 'textarea'].includes(tag);
-                        const interactiveRole = ['button', 'link', 'textbox', 'checkbox'].includes(role);
+                        const role = normalizeRole(tag, node.getAttribute('role'));
+                        const interactiveRole = ['button', 'link', 'textbox', 'searchbox', 'combobox', 'checkbox', 'radio', 'switch', 'tab', 'menuitem', 'option'].includes(role);
                         if (interactiveTag || interactiveRole) {
                             if (!node.dataset.fastmcpRef) {

{thordata_mcp_server-0.4.4 → thordata_mcp_server-0.5.0}/src/thordata_mcp/config.py RENAMED Viewed

@@ -6,6 +6,14 @@ from pydantic_settings import BaseSettings
 class Settings(BaseSettings):
     """Environment-driven configuration for the MCP server."""
+    # MCP tool exposure mode (BrightData-like)
+    # - rapid: minimal core tools
+    # - pro: all tools
+    # - custom: enable by THORDATA_GROUPS and THORDATA_TOOLS
+    THORDATA_MODE: str = "rapid"
+    THORDATA_GROUPS: str | None = None
+    THORDATA_TOOLS: str | None = None
     # Thordata credentials
     THORDATA_SCRAPER_TOKEN: str | None = None
     THORDATA_PUBLIC_TOKEN: str | None = None
@@ -20,9 +28,9 @@ class Settings(BaseSettings):
     # Tasks discovery UX (to avoid dumping hundreds of tools to the client by default)
     # - mode=curated: only return tools from THORDATA_TASKS_GROUPS, with pagination
     # - mode=all: return all discovered tools
-    # Default to listing ALL Web Scraper tasks, but paginated (no env changes required for “100+ tools” use-case).
-    THORDATA_TASKS_LIST_MODE: str = "all"
-    THORDATA_TASKS_LIST_DEFAULT_LIMIT: int = 100
+    # Default to curated mode to reduce tool selection noise for LLMs.
+    THORDATA_TASKS_LIST_MODE: str = "curated"
+    THORDATA_TASKS_LIST_DEFAULT_LIMIT: int = 60
     THORDATA_TASKS_GROUPS: str = "ecommerce,social,video,search,travel,code,professional"
     # Optional: restrict which SDK tool_keys are allowed to execute (safety/UX)
@@ -49,6 +57,9 @@ class Settings(BaseSettings):
     # Logging
     LOG_LEVEL: str = "INFO"
+    # Debug tools exposure
+    THORDATA_DEBUG_TOOLS: bool = False
     class Config:
         env_file = ".env"
         extra = "ignore"

{thordata_mcp_server-0.4.4 → thordata_mcp_server-0.5.0}/src/thordata_mcp/context.py RENAMED Viewed

@@ -35,4 +35,4 @@ class ServerContext:
         if cls._client:
             await cls._client.close()
-            cls._client = None
+            cls._client = None

thordata-mcp-server 0.4.4__tar.gz → 0.5.0__tar.gz

thordata-mcp-server 0.4.4tar.gz → 0.5.0tar.gz