PyPI - structx-mcp - Versions diffs - 0.1.0__tar.gz - Mend

structx-mcp 0.1.0__tar.gz

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (12) hide show

structx_mcp-0.1.0/.gitignore +12 -0
structx_mcp-0.1.0/CHANGELOG.md +39 -0
structx_mcp-0.1.0/LICENSE +21 -0
structx_mcp-0.1.0/PKG-INFO +140 -0
structx_mcp-0.1.0/README.md +112 -0
structx_mcp-0.1.0/pyproject.toml +72 -0
structx_mcp-0.1.0/src/structx_mcp/__init__.py +24 -0
structx_mcp-0.1.0/src/structx_mcp/__main__.py +18 -0
structx_mcp-0.1.0/src/structx_mcp/_tools.py +228 -0
structx_mcp-0.1.0/src/structx_mcp/_url_safety.py +127 -0
structx_mcp-0.1.0/src/structx_mcp/_version.py +1 -0
structx_mcp-0.1.0/src/structx_mcp/server.py +214 -0

structx_mcp-0.1.0/.gitignore ADDED Viewed

@@ -0,0 +1,12 @@
+dist/
+build/
+*.egg-info/
+.venv/
+venv/
+__pycache__/
+*.pyc
+.pytest_cache/
+.mypy_cache/
+.ruff_cache/
+.vscode/
+.idea/

structx_mcp-0.1.0/CHANGELOG.md ADDED Viewed

@@ -0,0 +1,39 @@
+# Changelog
+All notable changes to the `structx-mcp` package are documented here.
+This project follows [Semantic Versioning](https://semver.org/).
+## [Unreleased]
+### Changed
+- Internal: bumped the `structx` dependency to `structx-sdk>=0.2.0` (SDK package rename). No customer-facing change — the MCP package name, CLI command, and configuration are unchanged.
+## [0.1.0] — 2026-05-24
+Initial release. Replaces the in-tree `backend/mcp/server.py` (which
+used raw httpx) with a standalone PyPI package backed by the official
+`structx` Python SDK.
+### Added
+- 5 MCP tools designed for agent-first use:
+  - `structx_extract` — extract using inline schema OR `template_slug`.
+  - `structx_infer_schema` — generate a schema from sample content.
+  - `structx_extract_from_url` — fetch + extract in one call, SSRF-defended.
+  - `structx_list_templates` — browse the public template catalog.
+  - `structx_usage` — current credit usage.
+- `structx-mcp` console-script entrypoint for Claude Desktop / Cursor / Windsurf configs.
+- SSRF defense for URL fetching (HTTPS only, IP classification, hostname blocklist, no redirects, 10MB cap).
+- Typed-error pass-through — SDK errors surface to the agent as JSON `{"error", "type"}` rather than vanishing into transport.
+### Compatible with
+- Python 3.10+.
+- `structx>=0.1.0,<1.0.0` (the Python SDK).
+- MCP protocol via the `mcp>=1.0.0` reference implementation.
+- struct-x API v1.x.
+### Migration note (from the in-tree server)
+The in-tree `backend/mcp/server.py` had hardcoded convenience tools
+(`structx_extract_product`, `_article`, `_contact`). Those are dropped
+in favor of `structx_extract(template_slug="ecommerce.product")` etc.
+— new templates ship without code changes, and the tool catalog stays
+lean (agents handle small tool surfaces better).

structx_mcp-0.1.0/LICENSE ADDED Viewed

@@ -0,0 +1,21 @@
+MIT License
+Copyright (c) 2026 struct-x
+Permission is hereby granted, free of charge, to any person obtaining a copy
+of this software and associated documentation files (the "Software"), to deal
+in the Software without restriction, including without limitation the rights
+to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
+copies of the Software, and to permit persons to whom the Software is
+furnished to do so, subject to the following conditions:
+The above copyright notice and this permission notice shall be included in all
+copies or substantial portions of the Software.
+THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
+IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
+FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
+AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
+LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
+OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
+SOFTWARE.

structx_mcp-0.1.0/PKG-INFO ADDED Viewed

@@ -0,0 +1,140 @@
+Metadata-Version: 2.4
+Name: structx-mcp
+Version: 0.1.0
+Summary: Model Context Protocol server for struct-x — install in Claude Desktop, Cursor, Windsurf, or any MCP-compatible client to give your agent structured-extraction superpowers.
+Project-URL: Homepage, https://structx.ai
+Project-URL: Documentation, https://docs.structx.ai/mcp
+Project-URL: Repository, https://github.com/struct-x-ai/struct-x
+Project-URL: Issues, https://github.com/struct-x-ai/struct-x/issues
+Author-email: struct-x <support@structx.ai>
+License: MIT
+License-File: LICENSE
+Keywords: agent,claude,cursor,llm,mcp,model-context-protocol,structured-extraction,structx
+Classifier: Development Status :: 4 - Beta
+Classifier: Intended Audience :: Developers
+Classifier: License :: OSI Approved :: MIT License
+Classifier: Programming Language :: Python
+Classifier: Programming Language :: Python :: 3 :: Only
+Classifier: Topic :: Software Development :: Libraries
+Requires-Python: >=3.10
+Requires-Dist: httpx<0.29,>=0.27
+Requires-Dist: mcp>=1.0.0
+Requires-Dist: structx-sdk<1.0.0,>=0.2.0
+Provides-Extra: dev
+Requires-Dist: pytest-asyncio>=0.23; extra == 'dev'
+Requires-Dist: pytest>=8.0; extra == 'dev'
+Requires-Dist: respx>=0.21; extra == 'dev'
+Description-Content-Type: text/markdown
+# structx-mcp
+[![PyPI](https://img.shields.io/pypi/v/structx-mcp.svg)](https://pypi.org/project/structx-mcp/)
+[![License](https://img.shields.io/badge/license-MIT-blue.svg)](LICENSE)
+Model Context Protocol (MCP) server for **[struct-x](https://structx.ai)** — install in Claude Desktop, Cursor, Windsurf, or any MCP-compatible client and give your agent first-class structured-extraction tools.
+```
+You: extract the product details from this Aeron chair page →
+     [pastes URL or HTML]
+Agent: [calls structx_extract under the hood, returns typed JSON]
+     {
+       "title": "Aeron Chair",
+       "price_cents": 179500,
+       "brand": "Herman Miller",
+       …
+     }
+```
+No HTTP plumbing, no schema-writing detours — the agent just *uses* extraction.
+## Install
+```bash
+pip install structx-mcp
+```
+Then add to your MCP client's config.
+### Claude Desktop
+Edit `~/Library/Application Support/Claude/claude_desktop_config.json` (macOS) or `%APPDATA%\Claude\claude_desktop_config.json` (Windows):
+```json
+{
+  "mcpServers": {
+    "structx": {
+      "command": "structx-mcp",
+      "env": {
+        "STRUCTX_API_KEY": "sx_..."
+      }
+    }
+  }
+}
+```
+Restart Claude Desktop. Confirm by typing `/mcp` and verifying `structx` is listed.
+### Cursor
+`.cursor/mcp.json` (project-scope) or `~/.cursor/mcp.json` (user-scope):
+```json
+{
+  "mcpServers": {
+    "structx": {
+      "command": "structx-mcp",
+      "env": { "STRUCTX_API_KEY": "sx_..." }
+    }
+  }
+}
+```
+### Windsurf / any other MCP client
+Same pattern — `command: "structx-mcp"`, `env: { STRUCTX_API_KEY: ... }`. The server speaks the standard MCP stdio protocol.
+## Tools the agent gets
+| Tool | What it does |
+|---|---|
+| **`structx_extract`** | Extract structured JSON from raw content (HTML / markdown / text / JSON / emails) using a JSON Schema you provide OR a pre-built template slug. Returns per-field confidence scores. |
+| **`structx_infer_schema`** | Don't have a schema yet? Paste raw content, get a draft schema back with per-field rationales. Plus matching templates from the public catalog. |
+| **`structx_extract_from_url`** | Fetch a URL and extract in one call. Sandboxed (HTTPS only, public DNS only, 10MB cap, no redirects). Replaces the two-call `fetch_url → extract` pattern. |
+| **`structx_list_templates`** | Browse every public template — saves the agent from hand-writing a schema for common formats (Stripe events, GitHub issues, product pages, news articles). |
+| **`structx_usage`** | Current credit usage for the API key. Pre-flight check before batches, or post-mortem after a `RateLimitError`. |
+## Why use an MCP server vs. the SDK directly?
+The Python SDK ([`structx-sdk`](https://pypi.org/project/structx-sdk/)) is for building **applications**. This MCP package is for putting struct-x **inside the AI tools you already use** — Claude Desktop, Cursor, Windsurf. The agent calls these tools natively, no app code required.
+Both are first-party; they share the same backend, same credit pool, same API key.
+## Security
+`structx_extract_from_url` is sandboxed at the network layer:
+- HTTPS only (no `http://`, `file://`, `ftp://`, etc.)
+- Hostname blocklist for cloud metadata targets (`169.254.169.254`, `metadata.google.internal`, etc.)
+- Hostname suffix blocklist for internal-network conventions (`.internal`, `.local`, `.cluster.local`, `.consul`)
+- DNS resolution + IP classification — refuses any private, loopback, link-local, multicast, or reserved IP
+- No redirect following (one less SSRF vector)
+- 10MB response cap
+This matches the defenses on the struct-x backend's webhook outbound-URL validator. See [`src/structx_mcp/_url_safety.py`](src/structx_mcp/_url_safety.py) for the implementation.
+## Development
+```bash
+git clone https://github.com/struct-x-ai/struct-x
+cd struct-x/mcp/structx-mcp
+# Install editable + the SDK from the local checkout (the SDK is in
+# the same monorepo at sdk/python/).
+pip install -e "../../sdk/python" -e ".[dev]"
+pytest -q
+```
+## License
+MIT — see [LICENSE](LICENSE).

structx_mcp-0.1.0/README.md ADDED Viewed

@@ -0,0 +1,112 @@
+# structx-mcp
+[![PyPI](https://img.shields.io/pypi/v/structx-mcp.svg)](https://pypi.org/project/structx-mcp/)
+[![License](https://img.shields.io/badge/license-MIT-blue.svg)](LICENSE)
+Model Context Protocol (MCP) server for **[struct-x](https://structx.ai)** — install in Claude Desktop, Cursor, Windsurf, or any MCP-compatible client and give your agent first-class structured-extraction tools.
+```
+You: extract the product details from this Aeron chair page →
+     [pastes URL or HTML]
+Agent: [calls structx_extract under the hood, returns typed JSON]
+     {
+       "title": "Aeron Chair",
+       "price_cents": 179500,
+       "brand": "Herman Miller",
+       …
+     }
+```
+No HTTP plumbing, no schema-writing detours — the agent just *uses* extraction.
+## Install
+```bash
+pip install structx-mcp
+```
+Then add to your MCP client's config.
+### Claude Desktop
+Edit `~/Library/Application Support/Claude/claude_desktop_config.json` (macOS) or `%APPDATA%\Claude\claude_desktop_config.json` (Windows):
+```json
+{
+  "mcpServers": {
+    "structx": {
+      "command": "structx-mcp",
+      "env": {
+        "STRUCTX_API_KEY": "sx_..."
+      }
+    }
+  }
+}
+```
+Restart Claude Desktop. Confirm by typing `/mcp` and verifying `structx` is listed.
+### Cursor
+`.cursor/mcp.json` (project-scope) or `~/.cursor/mcp.json` (user-scope):
+```json
+{
+  "mcpServers": {
+    "structx": {
+      "command": "structx-mcp",
+      "env": { "STRUCTX_API_KEY": "sx_..." }
+    }
+  }
+}
+```
+### Windsurf / any other MCP client
+Same pattern — `command: "structx-mcp"`, `env: { STRUCTX_API_KEY: ... }`. The server speaks the standard MCP stdio protocol.
+## Tools the agent gets
+| Tool | What it does |
+|---|---|
+| **`structx_extract`** | Extract structured JSON from raw content (HTML / markdown / text / JSON / emails) using a JSON Schema you provide OR a pre-built template slug. Returns per-field confidence scores. |
+| **`structx_infer_schema`** | Don't have a schema yet? Paste raw content, get a draft schema back with per-field rationales. Plus matching templates from the public catalog. |
+| **`structx_extract_from_url`** | Fetch a URL and extract in one call. Sandboxed (HTTPS only, public DNS only, 10MB cap, no redirects). Replaces the two-call `fetch_url → extract` pattern. |
+| **`structx_list_templates`** | Browse every public template — saves the agent from hand-writing a schema for common formats (Stripe events, GitHub issues, product pages, news articles). |
+| **`structx_usage`** | Current credit usage for the API key. Pre-flight check before batches, or post-mortem after a `RateLimitError`. |
+## Why use an MCP server vs. the SDK directly?
+The Python SDK ([`structx-sdk`](https://pypi.org/project/structx-sdk/)) is for building **applications**. This MCP package is for putting struct-x **inside the AI tools you already use** — Claude Desktop, Cursor, Windsurf. The agent calls these tools natively, no app code required.
+Both are first-party; they share the same backend, same credit pool, same API key.
+## Security
+`structx_extract_from_url` is sandboxed at the network layer:
+- HTTPS only (no `http://`, `file://`, `ftp://`, etc.)
+- Hostname blocklist for cloud metadata targets (`169.254.169.254`, `metadata.google.internal`, etc.)
+- Hostname suffix blocklist for internal-network conventions (`.internal`, `.local`, `.cluster.local`, `.consul`)
+- DNS resolution + IP classification — refuses any private, loopback, link-local, multicast, or reserved IP
+- No redirect following (one less SSRF vector)
+- 10MB response cap
+This matches the defenses on the struct-x backend's webhook outbound-URL validator. See [`src/structx_mcp/_url_safety.py`](src/structx_mcp/_url_safety.py) for the implementation.
+## Development
+```bash
+git clone https://github.com/struct-x-ai/struct-x
+cd struct-x/mcp/structx-mcp
+# Install editable + the SDK from the local checkout (the SDK is in
+# the same monorepo at sdk/python/).
+pip install -e "../../sdk/python" -e ".[dev]"
+pytest -q
+```
+## License
+MIT — see [LICENSE](LICENSE).

structx_mcp-0.1.0/pyproject.toml ADDED Viewed

@@ -0,0 +1,72 @@
+[build-system]
+requires = ["hatchling>=1.18"]
+build-backend = "hatchling.build"
+[project]
+name = "structx-mcp"
+version = "0.1.0"
+description = "Model Context Protocol server for struct-x — install in Claude Desktop, Cursor, Windsurf, or any MCP-compatible client to give your agent structured-extraction superpowers."
+readme = "README.md"
+requires-python = ">=3.10"
+license = { text = "MIT" }
+authors = [
+    { name = "struct-x", email = "support@structx.ai" },
+]
+keywords = [
+    "mcp",
+    "model-context-protocol",
+    "structured-extraction",
+    "llm",
+    "agent",
+    "claude",
+    "cursor",
+    "structx",
+]
+classifiers = [
+    "Development Status :: 4 - Beta",
+    "Intended Audience :: Developers",
+    "License :: OSI Approved :: MIT License",
+    "Programming Language :: Python",
+    "Programming Language :: Python :: 3 :: Only",
+    "Topic :: Software Development :: Libraries",
+]
+dependencies = [
+    "structx-sdk>=0.2.0,<1.0.0",
+    "mcp>=1.0.0",
+    "httpx>=0.27,<0.29",
+]
+[project.optional-dependencies]
+dev = [
+    "pytest>=8.0",
+    "pytest-asyncio>=0.23",
+    "respx>=0.21",
+]
+[project.scripts]
+# `structx-mcp` becomes a CLI command on install. Claude Desktop /
+# Cursor / Windsurf can reference it directly in their MCP config:
+#   { "mcpServers": { "structx": { "command": "structx-mcp", ... } } }
+structx-mcp = "structx_mcp.__main__:main"
+[project.urls]
+Homepage = "https://structx.ai"
+Documentation = "https://docs.structx.ai/mcp"
+Repository = "https://github.com/struct-x-ai/struct-x"
+Issues = "https://github.com/struct-x-ai/struct-x/issues"
+[tool.hatch.build.targets.wheel]
+packages = ["src/structx_mcp"]
+[tool.hatch.build.targets.sdist]
+include = [
+    "src/structx_mcp",
+    "README.md",
+    "CHANGELOG.md",
+    "LICENSE",
+    "pyproject.toml",
+]
+[tool.pytest.ini_options]
+testpaths = ["tests"]
+asyncio_mode = "auto"

structx_mcp-0.1.0/src/structx_mcp/__init__.py ADDED Viewed

@@ -0,0 +1,24 @@
+"""structx-mcp — Model Context Protocol server for struct-x.
+Install in Claude Desktop / Cursor / Windsurf or any MCP-compatible
+client and give your agent first-class structured-extraction tools.
+Quickstart (Claude Desktop):
+    1. pip install structx-mcp
+    2. Add to ~/Library/Application Support/Claude/claude_desktop_config.json:
+        {
+          "mcpServers": {
+            "structx": {
+              "command": "structx-mcp",
+              "env": { "STRUCTX_API_KEY": "sx_..." }
+            }
+          }
+        }
+    3. Restart Claude Desktop.
+"""
+from ._version import __version__
+__all__ = ["__version__"]

structx_mcp-0.1.0/src/structx_mcp/__main__.py ADDED Viewed

@@ -0,0 +1,18 @@
+"""Entry point invoked by the `structx-mcp` console script (defined in
+pyproject.toml's [project.scripts]). Also runnable as `python -m
+structx_mcp`.
+"""
+from __future__ import annotations
+import asyncio
+from .server import _check_api_key, serve
+def main() -> None:
+    _check_api_key()
+    asyncio.run(serve())
+if __name__ == "__main__":
+    main()

structx_mcp-0.1.0/src/structx_mcp/_tools.py ADDED Viewed

@@ -0,0 +1,228 @@
+"""Tool definitions for the structx MCP server.
+Tool DESCRIPTIONS are the public API for any MCP — an LLM agent reads
+them to decide whether/when to invoke a tool. They need to be:
+  - SPECIFIC about what kind of input the tool eats (HTML? URL? text?)
+  - CLEAR about the output shape (typed JSON? confidence scores?)
+  - HONEST about edge cases the agent should know (5KB hard cap, etc.)
+  - AGENT-FIRST in their phrasing — "use this for X, Y, Z" beats
+    "this function does X"
+Kept separate from the server runtime so the descriptions are easy
+to review independently of the wiring code.
+"""
+from __future__ import annotations
+from mcp.types import Tool
+# ── Tool 1 — the workhorse ──────────────────────────────────
+EXTRACT_TOOL = Tool(
+    name="structx_extract",
+    description=(
+        "Extract structured JSON from raw content (HTML, markdown, plain text, "
+        "JSON-API responses, emails, scraped web pages, documents) using a JSON "
+        "Schema you provide OR a pre-built template slug. Returns typed data with "
+        "per-field confidence scores so you know which fields to trust. "
+        "Use this whenever you need to turn unstructured/semi-structured text into "
+        "reliable JSON. Prefer the `template_slug` argument for common formats "
+        "(Stripe events, GitHub issues, product pages, news articles) — see "
+        "`structx_list_templates` to discover available ones. "
+        "Content cap: 500KB. Calls bill against your API key's credit quota."
+    ),
+    inputSchema={
+        "type": "object",
+        "properties": {
+            "content": {
+                "type": "string",
+                "description": "Raw text/HTML/markdown to extract from. 500KB max.",
+            },
+            "schema": {
+                "type": "object",
+                "description": (
+                    "Inline JSON Schema describing the desired output shape. "
+                    "Use this OR `template_slug`, not both. Required: `type` "
+                    "field at root; `type` must be 'object' or 'array'."
+                ),
+            },
+            "template_slug": {
+                "type": "string",
+                "description": (
+                    "Catalog template identifier. Bare slug (e.g. 'logs.stripe.event') "
+                    "uses the latest published version; pin to a specific version with "
+                    "'family.name@1.0.0'. Mutually exclusive with `schema`."
+                ),
+            },
+            "tier": {
+                "type": "string",
+                "enum": ["required", "recommended", "optional", "extended"],
+                "default": "required",
+                "description": (
+                    "Field-depth dial when using a template. 'required' returns only "
+                    "the headline fields (cheapest). 'extended' returns every field "
+                    "in the template. Ignored for inline-schema extractions."
+                ),
+            },
+            "options": {
+                "type": "object",
+                "description": (
+                    "Optional extraction overrides. Common keys: `include_citations` "
+                    "(boolean) to add source snippets per field; `use_cache` (boolean, "
+                    "default true) to allow cached responses; `confidence_threshold` "
+                    "(float 0-1) to flag low-confidence results."
+                ),
+            },
+        },
+        "required": ["content"],
+    },
+)
+# ── Tool 2 — the magic moment ──────────────────────────────
+INFER_SCHEMA_TOOL = Tool(
+    name="structx_infer_schema",
+    description=(
+        "Don't have a schema yet? Paste raw content and get a draft JSON Schema "
+        "back, with per-field rationales explaining why each field was inferred. "
+        "Optionally returns matching templates from the public catalog. "
+        "Use this when starting fresh with unfamiliar content (a new SaaS webhook, "
+        "an unknown API response, an unstructured document type) — saves the agent "
+        "from hand-writing a schema. The returned schema can be passed directly "
+        "back to `structx_extract`. Cost: 3-5 credits per call (higher than "
+        "extraction because of the meta-reasoning involved)."
+    ),
+    inputSchema={
+        "type": "object",
+        "properties": {
+            "content": {
+                "type": "string",
+                "description": (
+                    "Sample content the schema should describe. 200 chars min, "
+                    "100KB max. One representative sample is plenty — don't paste "
+                    "a giant corpus."
+                ),
+            },
+            "content_type": {
+                "type": "string",
+                "enum": ["html", "json", "markdown", "text"],
+                "description": (
+                    "Hint about the format. Optional — the backend can guess from "
+                    "content, but a hint improves inference quality."
+                ),
+            },
+            "hints": {
+                "type": "object",
+                "description": (
+                    "Optional constraints. Keys: `vertical` (e.g. 'ecommerce', "
+                    "'finance', 'logs'), `fields_must_include` (array of field names "
+                    "that MUST appear in the inferred schema)."
+                ),
+            },
+            "return_recommendations": {
+                "type": "boolean",
+                "default": True,
+                "description": (
+                    "If true, also returns matching templates from the public catalog "
+                    "ranked by similarity. Useful for the agent to decide between "
+                    "'use this template' and 'use my freshly-inferred schema'."
+                ),
+            },
+        },
+        "required": ["content"],
+    },
+)
+# ── Tool 3 — the integration smoothie ──────────────────────
+EXTRACT_FROM_URL_TOOL = Tool(
+    name="structx_extract_from_url",
+    description=(
+        "Fetch a URL and extract structured JSON from it in one call. Removes "
+        "the two-call pattern (fetch_url → structx_extract) that every other agent "
+        "workflow ends up writing. Honors the same `schema` / `template_slug` / "
+        "`tier` / `options` arguments as `structx_extract`. "
+        "URL fetching is sandboxed: HTTPS only, public DNS only (no metadata "
+        "endpoints, no link-local, no private ranges, no DNS rebinds), 10MB "
+        "response cap, no redirects followed. Use for live-page extraction "
+        "(product pages, news articles, RSS items, public docs)."
+    ),
+    inputSchema={
+        "type": "object",
+        "properties": {
+            "url": {
+                "type": "string",
+                "description": "Public HTTPS URL to fetch. http://, file://, ftp://, etc. rejected.",
+            },
+            "schema": {
+                "type": "object",
+                "description": "Inline JSON Schema. Mutually exclusive with `template_slug`.",
+            },
+            "template_slug": {
+                "type": "string",
+                "description": "Catalog template identifier. Mutually exclusive with `schema`.",
+            },
+            "tier": {
+                "type": "string",
+                "enum": ["required", "recommended", "optional", "extended"],
+                "default": "required",
+            },
+            "options": {
+                "type": "object",
+                "description": "Forwarded to /v1/extract — see `structx_extract`.",
+            },
+        },
+        "required": ["url"],
+    },
+)
+# ── Tool 4 — discovery ──────────────────────────────────────
+LIST_TEMPLATES_TOOL = Tool(
+    name="structx_list_templates",
+    description=(
+        "List every public template in the struct-x catalog. Each template has a "
+        "slug (e.g. 'logs.stripe.event'), a name, a description, an optional "
+        "category, and the underlying JSON Schema. Use this to discover what "
+        "schemas already exist BEFORE writing your own — chances are someone "
+        "(possibly the platform team, possibly another user via promotion) has "
+        "already published a template for the format you care about. "
+        "Pair with `structx_extract(template_slug=...)` to skip schema-writing entirely."
+    ),
+    inputSchema={"type": "object", "properties": {}},
+)
+# ── Tool 5 — observability ──────────────────────────────────
+USAGE_TOOL = Tool(
+    name="structx_usage",
+    description=(
+        "Return the current credit usage for the API key in use — credits used "
+        "today, daily limit, current tier. Useful as a pre-flight check before a "
+        "batch operation, OR after a `RateLimitError` to confirm the cap is "
+        "actually exhausted vs. a transient throttle."
+    ),
+    inputSchema={"type": "object", "properties": {}},
+)
+# Ordered list for `list_tools()` — sort by likely usage frequency
+# (extract > infer > list > url > usage). Agents tend to surface the
+# first few tools more readily; put the workhorses up top.
+ALL_TOOLS: list[Tool] = [
+    EXTRACT_TOOL,
+    INFER_SCHEMA_TOOL,
+    LIST_TEMPLATES_TOOL,
+    EXTRACT_FROM_URL_TOOL,
+    USAGE_TOOL,
+]

structx_mcp-0.1.0/src/structx_mcp/_url_safety.py ADDED Viewed

@@ -0,0 +1,127 @@
+"""SSRF defense for `structx_extract_from_url`.
+Agents can be tricked into fetching internal-network URLs via prompt
+injection — "please summarize http://169.254.169.254/latest/meta-data/"
+extracts AWS instance metadata. This module validates URLs before
+`structx_extract_from_url` actually fetches anything.
+Pattern mirrored from backend/services/webhooks.py (PR #173). Kept
+inline here so the MCP package doesn't need to import backend code.
+Defense layers:
+  1. Scheme allowlist (https only)
+  2. Hostname blocklist (cloud-metadata + .internal/.local)
+  3. DNS resolution → IP classification (rejects private, loopback,
+     link-local, multicast, reserved, ipv6-link-local)
+  4. Response size cap (10MB) — enforced at the caller, but documented
+     here as part of the threat model.
+Not in this module's scope:
+  - Redirect following (caller pins follow_redirects=False)
+  - Content-Type checks (caller's responsibility)
+  - Authentication header stripping (no auth headers sent in the first
+    place — caller uses a vanilla httpx call with no creds)
+"""
+from __future__ import annotations
+import ipaddress
+import socket
+from urllib.parse import urlparse
+class UrlSafetyError(ValueError):
+    """Raised when a URL fails any SSRF check."""
+# Hostnames that resolve to internal services on common cloud providers.
+# These should NEVER be reachable from a public-facing fetch.
+_BLOCKED_HOSTNAMES = frozenset(
+    {
+        "metadata.google.internal",
+        "metadata",
+        "instance-data",
+        "instance-data.ec2.internal",
+        "169.254.169.254",  # AWS/Azure/GCP metadata IP
+        "100.100.100.200",  # Alibaba Cloud metadata IP
+        "fd00:ec2::254",  # AWS IPv6 metadata
+    }
+)
+_BLOCKED_HOSTNAME_SUFFIXES = (
+    ".internal",
+    ".local",
+    ".cluster.local",
+    ".consul",
+    ".lan",
+)
+def validate_fetch_url(url: str) -> tuple[str, str]:
+    """Validate `url` is safe to fetch. Returns (normalized_url, hostname).
+    Raises UrlSafetyError on any rule break."""
+    if not url:
+        raise UrlSafetyError("URL is empty")
+    parsed = urlparse(url)
+    # Scheme allowlist — https only. Blocks http (cleartext + downgrade
+    # risk), file:// (local file read), gopher:// (smuggling), data://
+    # (no fetch needed), and any other esoteric scheme.
+    if parsed.scheme != "https":
+        raise UrlSafetyError(
+            f"Only https URLs are allowed (got scheme={parsed.scheme!r})"
+        )
+    host = (parsed.hostname or "").lower()
+    if not host:
+        raise UrlSafetyError("URL has no hostname")
+    # Port allowlist (Phase 5.5 / chi). The scheme check above only
+    # rejects non-https; an attacker can still target arbitrary TCP
+    # ports on a public host via https://public-relay.example:25/ etc.
+    # That enables (a) using the agent as a network-position scanner
+    # of memcached/redis/elasticsearch via timing oracles, (b) reaching
+    # internal services via a public hostname that resolves to a
+    # public IP fronting an internal proxy. Allowlist: standard HTTPS
+    # only. Set port to None or 443; reject everything else.
+    if parsed.port is not None and parsed.port != 443:
+        raise UrlSafetyError(
+            f"Only standard HTTPS port (443) is allowed (got port={parsed.port})"
+        )
+    if host in _BLOCKED_HOSTNAMES:
+        raise UrlSafetyError(f"Hostname {host!r} is blocked (cloud-metadata target)")
+    if any(host.endswith(s) for s in _BLOCKED_HOSTNAME_SUFFIXES):
+        raise UrlSafetyError(
+            f"Hostname {host!r} matches a blocked internal-network suffix"
+        )
+    # Resolve DNS and reject any IP that classifies as non-public.
+    # This catches DNS-rebinding (private IPs hidden behind public-
+    # looking hostnames) and IPv6 link-local / unique-local.
+    try:
+        addrinfo = socket.getaddrinfo(host, None)
+    except socket.gaierror as e:
+        raise UrlSafetyError(f"DNS resolution failed for {host!r}: {e}") from e
+    seen: set[str] = set()
+    for entry in addrinfo:
+        ip_str = entry[4][0]
+        if ip_str in seen:
+            continue
+        seen.add(ip_str)
+        ip = ipaddress.ip_address(ip_str)
+        if (
+            ip.is_private
+            or ip.is_loopback
+            or ip.is_link_local
+            or ip.is_multicast
+            or ip.is_reserved
+            or ip.is_unspecified
+        ):
+            raise UrlSafetyError(
+                f"Hostname {host!r} resolves to non-public IP {ip_str} — refusing fetch"
+            )
+    return url, host

structx_mcp-0.1.0/src/structx_mcp/_version.py ADDED Viewed

	@@ -0,0 +1 @@
1	+ __version__ = "0.1.0"

structx_mcp-0.1.0/src/structx_mcp/server.py ADDED Viewed

@@ -0,0 +1,214 @@
+"""MCP server runtime — wires the tool definitions in `_tools.py` to
+the `structx-sdk` SDK from PyPI.
+Designed for stdio transport (the Claude Desktop / Cursor / Windsurf
+default). HTTP transport would be a follow-up — different deployment
+model entirely.
+Layering:
+  - _tools.py defines TOOL METADATA (descriptions, input schemas).
+  - _url_safety.py validates URLs before fetch.
+  - server.py is the runtime — instantiates the SDK client once,
+    routes tool calls to SDK methods, serializes responses.
+Why dependency injection on the SDK client (`_make_client`): tests
+substitute a fake client with `respx`-mocked HTTP. Production code
+calls `_make_client()` with no args → defaults to env-driven SDK.
+"""
+from __future__ import annotations
+import asyncio
+import json
+import os
+import sys
+from typing import Any
+import httpx
+from mcp.server import Server
+from mcp.server.stdio import stdio_server
+from mcp.types import TextContent
+from structx_sdk import AsyncStructX, StructXError
+from ._tools import ALL_TOOLS
+from ._url_safety import UrlSafetyError, validate_fetch_url
+from ._version import __version__
+# Response size cap for `structx_extract_from_url`. 10MB matches typical
+# web-page upper bounds; larger pages are almost certainly site dumps
+# or attacks rather than real extraction targets.
+_URL_FETCH_BYTE_CAP = 10 * 1024 * 1024
+# Overall-fetch deadline (Phase 5.5 / φ). httpx's per-operation timeout
+# doesn't bound a slow-loris server emitting 1 byte/chunk infinitely
+# slowly. asyncio.wait_for around the whole fetch enforces a true
+# total-time deadline. 60s is generous for a single page (95th-percentile
+# load time for real pages is <10s) while bounding the worst case.
+_URL_FETCH_TOTAL_TIMEOUT_S = 60.0
+def _make_client() -> AsyncStructX:
+    """Construct the SDK client from env. Separated from import-time
+    code so tests can patch it. Raises StructXError if STRUCTX_API_KEY
+    is missing."""
+    return AsyncStructX(
+        api_key=os.environ.get("STRUCTX_API_KEY", ""),
+        base_url=os.environ.get("STRUCTX_BASE_URL"),
+    )
+def _check_api_key() -> None:
+    """Friendly preflight — print a one-line error instead of letting
+    the SDK raise a traceback at first tool call."""
+    if not os.environ.get("STRUCTX_API_KEY"):
+        sys.stderr.write(
+            "ERROR: STRUCTX_API_KEY environment variable is required.\n"
+            "Get a key at https://app.structx.ai/api-keys, then add it\n"
+            "to your MCP-client config or shell environment.\n"
+        )
+        sys.exit(1)
+def build_server(client: AsyncStructX | None = None) -> Server:
+    """Build the MCP server. Injectable `client` for tests; defaults to
+    env-driven SDK at runtime.
+    Tool dispatch is a flat if/elif chain — there are 5 tools and the
+    indirection of a registry would obscure the wire-up. If we grow
+    past ~10 tools, refactor.
+    """
+    server = Server(f"structx-mcp/{__version__}")
+    sdk = client  # bound at call-time, see _get_sdk
+    def _get_sdk() -> AsyncStructX:
+        nonlocal sdk
+        if sdk is None:
+            sdk = _make_client()
+        return sdk
+    @server.list_tools()
+    async def _list_tools():
+        return ALL_TOOLS
+    @server.call_tool()
+    async def _call_tool(name: str, arguments: dict[str, Any]) -> list[TextContent]:
+        try:
+            payload = await _dispatch(name, arguments, _get_sdk())
+            return [TextContent(type="text", text=json.dumps(payload, indent=2, default=str))]
+        except StructXError as e:
+            # Surface SDK errors as JSON the agent can reason about,
+            # not as exceptions that vanish into the MCP transport.
+            return [TextContent(
+                type="text",
+                text=json.dumps({"error": str(e), "type": type(e).__name__}, indent=2),
+            )]
+        except UrlSafetyError as e:
+            return [TextContent(
+                type="text",
+                text=json.dumps({"error": str(e), "type": "UrlSafetyError"}, indent=2),
+            )]
+    return server
+async def _dispatch(
+    name: str,
+    arguments: dict[str, Any],
+    sdk: AsyncStructX,
+) -> dict[str, Any]:
+    """Pure-function dispatch — separate from server.py wiring so tests
+    can call it directly without spinning up an MCP transport."""
+    if name == "structx_extract":
+        result = await sdk.extract(
+            content=arguments["content"],
+            schema=arguments.get("schema"),
+            template_slug=arguments.get("template_slug"),
+            tier=arguments.get("tier", "required"),
+            options=arguments.get("options"),
+        )
+        return result.model_dump()
+    if name == "structx_infer_schema":
+        result = await sdk.infer_schema(
+            content=arguments["content"],
+            content_type=arguments.get("content_type"),
+            hints=arguments.get("hints"),
+            return_recommendations=arguments.get("return_recommendations", True),
+        )
+        return result.model_dump()
+    if name == "structx_extract_from_url":
+        url = arguments["url"]
+        validate_fetch_url(url)
+        content = await _fetch_url(url)
+        result = await sdk.extract(
+            content=content,
+            schema=arguments.get("schema"),
+            template_slug=arguments.get("template_slug"),
+            tier=arguments.get("tier", "required"),
+            options=arguments.get("options"),
+        )
+        return result.model_dump()
+    if name == "structx_list_templates":
+        templates = await sdk.list_templates()
+        return {"templates": [t.model_dump(by_alias=True) for t in templates]}
+    if name == "structx_usage":
+        usage = await sdk.usage()
+        return usage.model_dump()
+    raise ValueError(f"Unknown tool: {name}")
+async def _fetch_url(url: str) -> str:
+    """Fetch a URL with size cap + overall timeout + no-redirect.
+    The httpx `timeout` is per-operation (per-chunk during streaming
+    reads); a malicious slow-loris server emitting 1 byte every 29s
+    would never hit the per-chunk timeout but could tie up the agent
+    for hours. asyncio.wait_for around the whole fetch enforces a
+    total deadline, bounding the worst case regardless of chunk
+    pacing. (Phase 5.5 / φ.)
+    Streams the response and breaks at the size cap rather than reading
+    a multi-GB body into memory. Treats response as UTF-8 with errors='
+    replace' — extraction works on noisy text, no need to be strict
+    about encoding.
+    Validation already happened upstream in `validate_fetch_url`;
+    this is the actual I/O.
+    """
+    async def _do_fetch() -> str:
+        async with httpx.AsyncClient(
+            timeout=30.0,
+            follow_redirects=False,  # one less SSRF vector
+            limits=httpx.Limits(max_keepalive_connections=0),
+        ) as fetcher:
+            async with fetcher.stream("GET", url) as response:
+                response.raise_for_status()
+                chunks: list[bytes] = []
+                total = 0
+                async for chunk in response.aiter_bytes():
+                    total += len(chunk)
+                    if total > _URL_FETCH_BYTE_CAP:
+                        raise UrlSafetyError(
+                            f"Response exceeded {_URL_FETCH_BYTE_CAP // (1024*1024)}MB cap"
+                        )
+                    chunks.append(chunk)
+        return b"".join(chunks).decode("utf-8", errors="replace")
+    try:
+        return await asyncio.wait_for(_do_fetch(), timeout=_URL_FETCH_TOTAL_TIMEOUT_S)
+    except asyncio.TimeoutError as e:
+        raise UrlSafetyError(
+            f"Fetch exceeded {_URL_FETCH_TOTAL_TIMEOUT_S}s overall deadline "
+            f"(possible slow-loris server)"
+        ) from e
+async def serve() -> None:
+    """Run the MCP server over stdio."""
+    server = build_server()
+    async with stdio_server() as (read_stream, write_stream):
+        await server.run(read_stream, write_stream, server.create_initialization_options())