PyPI - docpull - Versions diffs - 2.5.0__tar.gz → 3.0.0__tar.gz - Mend

docpull 2.5.0tar.gz → 3.0.0tar.gz

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (85) hide show

{docpull-2.5.0/src/docpull.egg-info → docpull-3.0.0}/PKG-INFO RENAMED Viewed

@@ -1,6 +1,6 @@
 Metadata-Version: 2.4
 Name: docpull
-Version: 2.5.0
+Version: 3.0.0
 Summary: Pull documentation from the web and convert to clean markdown
 Author-email: Zachary Roth <support@raintree.technology>
 Maintainer-email: Raintree Technology <support@raintree.technology>
@@ -68,7 +68,6 @@ Provides-Extra: dev
 Requires-Dist: pytest>=7.0.0; extra == "dev"
 Requires-Dist: pytest-cov>=4.0.0; extra == "dev"
 Requires-Dist: pytest-asyncio>=0.21.0; extra == "dev"
-Requires-Dist: black>=23.0.0; extra == "dev"
 Requires-Dist: mypy>=1.0.0; extra == "dev"
 Requires-Dist: ruff>=0.1.0; extra == "dev"
 Requires-Dist: bandit>=1.7.0; extra == "dev"
@@ -222,7 +221,7 @@ pip install 'docpull[mcp]'
 docpull mcp  # starts the stdio server
 ```
-Add to Claude Desktop or Claude Code:
+Add to Claude Desktop or Claude Code manually:
 ```json
 {
@@ -235,13 +234,39 @@ Add to Claude Desktop or Claude Code:
 }
 ```
-Tools exposed:
+Or, if you use Claude Code, install the plugin instead — it bundles the MCP
+server, five slash commands (`/docs-add`, `/docs-search`, `/docs-list`,
+`/docs-refresh`, `/docs-remove`), and a meta-skill that teaches Claude
+when to reach for docpull automatically:
-- `fetch_url(url, max_tokens?)` — one-shot fetch, no crawl
-- `ensure_docs(source, force?)` — fetch a named library (cached 7 days)
+```bash
+# 1. Install docpull with the MCP extra (required for the plugin)
+pip install 'docpull[mcp]'
+```
+```
+# 2. Then in Claude Code:
+/plugin marketplace add raintree-technology/docpull
+/plugin install docpull@docpull
+```
+See [plugin/README.md](plugin/README.md) for details.
+Tools exposed (8 total — read tools advertise `readOnlyHint` so hosts that auto-approve safe tools won't prompt):
+Read:
+- `fetch_url(url, max_tokens?)` — one-shot fetch, no crawl. HTTPS-only, SSRF-validated.
 - `list_sources(category?)` — show available aliases (react, nextjs, fastapi, …)
-- `list_indexed()` — what has been fetched locally
-- `grep_docs(pattern, library?)` — regex search across fetched Markdown
+- `list_indexed()` — what has been fetched locally, with last-fetched age
+- `grep_docs(pattern, library?, limit?, context?)` — regex search across fetched Markdown (length-capped + wall-clock budgeted to mitigate ReDoS)
+- `read_doc(library, path, line_start?, line_end?)` — read a specific cached file, optionally line-sliced
+Write:
+- `ensure_docs(source, force?, profile?)` — fetch a named library (cached 7 days). Forwards progress to clients that supply a `progressToken`.
+- `add_source(name, url, description?, category?, max_pages?, force?)` — register a user alias (HTTPS-only, atomic write to `sources.yaml`).
+- `remove_source(name, delete_cache?)` — drop a user alias and (optionally) its cached docs.
+All tools that carry data also return `structuredContent` validated against an `outputSchema` for clients that prefer typed output.
 User-defined sources live in `~/.config/docpull-mcp/sources.yaml`:
@@ -254,6 +279,17 @@ sources:
     maxPages: 200
 ```
+### About the `mcp/` directory in this repo
+The `mcp/` directory at the repo root is a separate TypeScript + Bun MCP
+server backed by PostgreSQL with pgvector for semantic search. It is not
+the Python MCP server shipped in the `docpull` package described above
+— that one is the right choice for almost every user and is installed
+with `pip install 'docpull[mcp]'`. The `mcp/` tree is mirrored to its
+own repo at [`raintree-technology/docpull-mcp`](https://github.com/raintree-technology/docpull-mcp);
+unless you specifically need pgvector-backed semantic search, ignore it
+and use `docpull mcp`.
 ## Output
 Markdown files with YAML frontmatter:
@@ -350,6 +386,7 @@ docpull URL --preview-urls    # List URLs without fetching
 - [PyPI](https://pypi.org/project/docpull/)
 - [GitHub](https://github.com/raintree-technology/docpull)
 - [Changelog](https://github.com/raintree-technology/docpull/blob/main/docs/CHANGELOG.md)
+- [Metrics](https://github.com/raintree-technology/docpull/blob/main/METRICS.md) — auto-refreshed daily (PyPI downloads, plugin installs via clone count, traffic)
 ## License

{docpull-2.5.0 → docpull-3.0.0}/README.md RENAMED Viewed

@@ -140,7 +140,7 @@ pip install 'docpull[mcp]'
 docpull mcp  # starts the stdio server
 ```
-Add to Claude Desktop or Claude Code:
+Add to Claude Desktop or Claude Code manually:
 ```json
 {
@@ -153,13 +153,39 @@ Add to Claude Desktop or Claude Code:
 }
 ```
-Tools exposed:
+Or, if you use Claude Code, install the plugin instead — it bundles the MCP
+server, five slash commands (`/docs-add`, `/docs-search`, `/docs-list`,
+`/docs-refresh`, `/docs-remove`), and a meta-skill that teaches Claude
+when to reach for docpull automatically:
-- `fetch_url(url, max_tokens?)` — one-shot fetch, no crawl
-- `ensure_docs(source, force?)` — fetch a named library (cached 7 days)
+```bash
+# 1. Install docpull with the MCP extra (required for the plugin)
+pip install 'docpull[mcp]'
+```
+```
+# 2. Then in Claude Code:
+/plugin marketplace add raintree-technology/docpull
+/plugin install docpull@docpull
+```
+See [plugin/README.md](plugin/README.md) for details.
+Tools exposed (8 total — read tools advertise `readOnlyHint` so hosts that auto-approve safe tools won't prompt):
+Read:
+- `fetch_url(url, max_tokens?)` — one-shot fetch, no crawl. HTTPS-only, SSRF-validated.
 - `list_sources(category?)` — show available aliases (react, nextjs, fastapi, …)
-- `list_indexed()` — what has been fetched locally
-- `grep_docs(pattern, library?)` — regex search across fetched Markdown
+- `list_indexed()` — what has been fetched locally, with last-fetched age
+- `grep_docs(pattern, library?, limit?, context?)` — regex search across fetched Markdown (length-capped + wall-clock budgeted to mitigate ReDoS)
+- `read_doc(library, path, line_start?, line_end?)` — read a specific cached file, optionally line-sliced
+Write:
+- `ensure_docs(source, force?, profile?)` — fetch a named library (cached 7 days). Forwards progress to clients that supply a `progressToken`.
+- `add_source(name, url, description?, category?, max_pages?, force?)` — register a user alias (HTTPS-only, atomic write to `sources.yaml`).
+- `remove_source(name, delete_cache?)` — drop a user alias and (optionally) its cached docs.
+All tools that carry data also return `structuredContent` validated against an `outputSchema` for clients that prefer typed output.
 User-defined sources live in `~/.config/docpull-mcp/sources.yaml`:
@@ -172,6 +198,17 @@ sources:
     maxPages: 200
 ```
+### About the `mcp/` directory in this repo
+The `mcp/` directory at the repo root is a separate TypeScript + Bun MCP
+server backed by PostgreSQL with pgvector for semantic search. It is not
+the Python MCP server shipped in the `docpull` package described above
+— that one is the right choice for almost every user and is installed
+with `pip install 'docpull[mcp]'`. The `mcp/` tree is mirrored to its
+own repo at [`raintree-technology/docpull-mcp`](https://github.com/raintree-technology/docpull-mcp);
+unless you specifically need pgvector-backed semantic search, ignore it
+and use `docpull mcp`.
 ## Output
 Markdown files with YAML frontmatter:
@@ -268,6 +305,7 @@ docpull URL --preview-urls    # List URLs without fetching
 - [PyPI](https://pypi.org/project/docpull/)
 - [GitHub](https://github.com/raintree-technology/docpull)
 - [Changelog](https://github.com/raintree-technology/docpull/blob/main/docs/CHANGELOG.md)
+- [Metrics](https://github.com/raintree-technology/docpull/blob/main/METRICS.md) — auto-refreshed daily (PyPI downloads, plugin installs via clone count, traffic)
 ## License

{docpull-2.5.0 → docpull-3.0.0}/pyproject.toml RENAMED Viewed

@@ -4,7 +4,7 @@ build-backend = "setuptools.build_meta"
 [project]
 name = "docpull"
-version = "2.5.0"
+version = "3.0.0"
 dynamic = []
 description = "Pull documentation from the web and convert to clean markdown"
 readme = {file = "README.md", content-type = "text/markdown"}
@@ -102,7 +102,6 @@ dev = [
     "pytest>=7.0.0",
     "pytest-cov>=4.0.0",
     "pytest-asyncio>=0.21.0",
-    "black>=23.0.0",
     "mypy>=1.0.0",
     "ruff>=0.1.0",
     "bandit>=1.7.0",
@@ -132,10 +131,6 @@ include = ["docpull*"]
 [tool.setuptools.package-data]
 docpull = ["py.typed"]
-[tool.black]
-line-length = 110
-target-version = ["py310", "py311", "py312", "py313", "py314"]
 [tool.ruff]
 line-length = 110
 target-version = "py310"

{docpull-2.5.0 → docpull-3.0.0}/src/docpull/__init__.py RENAMED Viewed

@@ -14,7 +14,7 @@ Usage:
             print(event)
 """
-__version__ = "2.5.0"
+__version__ = "3.0.0"
 from .cache import CacheManager, StreamingDeduplicator
 from .conversion.chunking import Chunk, TokenCounter, chunk_markdown

{docpull-2.5.0 → docpull-3.0.0}/src/docpull/cli.py RENAMED Viewed

@@ -562,8 +562,7 @@ def run_fetcher(args: argparse.Namespace) -> int:
                         n_chunks = len(ctx.chunks) if ctx.chunks else 0
                         extra = f" ({n_chunks} chunks)" if n_chunks else ""
                         console.print(
-                            f"[green]Saved:[/green] {ctx.output_path} "
-                            f"[{ctx.source_type or 'generic'}]{extra}"
+                            f"[green]Saved:[/green] {ctx.output_path} [{ctx.source_type or 'generic'}]{extra}"
                         )
                     return 0

{docpull-2.5.0 → docpull-3.0.0}/src/docpull/conversion/special_cases.py RENAMED Viewed

@@ -246,7 +246,8 @@ def _describe_type(schema: Any, spec: dict[str, Any]) -> str:
     if not isinstance(schema, dict):
         return "?"
     if "$ref" in schema:
-        return schema["$ref"].rsplit("/", 1)[-1]
+        ref: str = schema["$ref"]
+        return ref.rsplit("/", 1)[-1]
     for key in ("oneOf", "anyOf", "allOf"):
         if isinstance(schema.get(key), list) and schema[key]:
             seen: list[str] = []
@@ -349,9 +350,7 @@ class OpenApiExtractor:
             for method, op in ops.items():
                 if method.lower() not in _HTTP_METHODS or not isinstance(op, dict):
                     continue
-                self._render_operation(
-                    lines, path, method, op, shared_params, data
-                )
+                self._render_operation(lines, path, method, op, shared_params, data)
         return SpecialCaseResult(
             markdown="\n".join(lines).strip() + "\n",
@@ -410,9 +409,7 @@ class OpenApiExtractor:
                 lines.append(bullet)
             lines.append("")
-    def _render_request_body(
-        self, lines: list[str], body: Any, spec: dict[str, Any]
-    ) -> None:
+    def _render_request_body(self, lines: list[str], body: Any, spec: dict[str, Any]) -> None:
         if not isinstance(body, dict):
             return
         if "$ref" in body:
@@ -455,9 +452,7 @@ class OpenApiExtractor:
             lines.append(f"- body: {_describe_type(schema, spec)}")
         lines.append("")
-    def _render_responses(
-        self, lines: list[str], responses: Any, spec: dict[str, Any]
-    ) -> None:
+    def _render_responses(self, lines: list[str], responses: Any, spec: dict[str, Any]) -> None:
         if not isinstance(responses, dict) or not responses:
             return
         lines.append("**Responses:**")
@@ -535,11 +530,7 @@ class MdxSourceExtractor:
         for pattern in self._EDIT_PATTERNS:
             match = pattern.search(text)
             if match:
-                raw_url = (
-                    match.group(1)
-                    .replace("/blob/", "/raw/")
-                    .replace("/edit/", "/raw/")
-                )
+                raw_url = match.group(1).replace("/blob/", "/raw/").replace("/edit/", "/raw/")
                 # Return None so downstream runs, but attach hint via a cache
                 # mechanism. Simpler: return None always; step reads the URL
                 # if needed by re-running the regex.
@@ -567,9 +558,7 @@ def find_mdx_source_url(html: bytes) -> str | None:
     for pattern in MdxSourceExtractor._EDIT_PATTERNS:
         match = pattern.search(text)
         if match:
-            return (
-                match.group(1).replace("/blob/", "/raw/").replace("/edit/", "/raw/")
-            )
+            return match.group(1).replace("/blob/", "/raw/").replace("/edit/", "/raw/")
     return None

{docpull-2.5.0 → docpull-3.0.0}/src/docpull/core/fetcher.py RENAMED Viewed

@@ -265,9 +265,7 @@ class Fetcher:
         # built-in 50 MB ceiling.
         max_content_size_kw: dict[str, int] = {}
         if self.config.content_filter.max_file_size is not None:
-            max_content_size_kw["max_content_size"] = int(
-                self.config.content_filter.max_file_size
-            )
+            max_content_size_kw["max_content_size"] = int(self.config.content_filter.max_file_size)
         self._http_client = AsyncHttpClient(
             rate_limiter=self._rate_limiter,
             max_retries=self.config.network.max_retries,
@@ -509,11 +507,7 @@ class Fetcher:
         steps = self._pipeline.steps
         if not save:
-            steps = [
-                s
-                for s in steps
-                if s.name not in {"save", "save_json", "save_ndjson", "save_sqlite"}
-            ]
+            steps = [s for s in steps if s.name not in {"save", "save_json", "save_ndjson", "save_sqlite"}]
         pipeline = type(self._pipeline)(steps=steps)
         ctx = await pipeline.execute(url, output_path)
         if ctx.error:
@@ -531,8 +525,8 @@ class Fetcher:
         """
         Compute output path for a URL using the configured naming strategy.
-        - ``full`` / ``flat`` / ``short``: a single flattened filename
-          (URL path joined with underscores).
+        - ``full``: a single flattened filename (URL path joined with
+          underscores).
         - ``hierarchical``: URL path preserved as nested directories,
           terminating in ``<segment>.md`` or ``index.md`` for trailing
           slashes. The leaf is `_validate_output_path`-safe — every segment
@@ -545,7 +539,6 @@ class Fetcher:
             parts = _url_to_path_parts(url, self.config.url)
             return output_dir.joinpath(*parts)
-        # full / flat / short: aliased to full until 3.0
         filename = _url_to_filename(url, self.config.url)
         return output_dir / filename
@@ -638,9 +631,7 @@ class Fetcher:
         )
         discovered: list[str] = []
-        async for url in self._discoverer.discover(
-            start_url, max_urls=self.config.crawl.max_pages
-        ):
+        async for url in self._discoverer.discover(start_url, max_urls=self.config.crawl.max_pages):
             discovered.append(url)
             if self._cancelled:
                 yield FetchEvent(
@@ -756,9 +747,7 @@ class Fetcher:
                 )
             )
             try:
-                async for url in discoverer.discover(
-                    start_url, max_urls=self.config.crawl.max_pages
-                ):
+                async for url in discoverer.discover(start_url, max_urls=self.config.crawl.max_pages):
                     if self._cancelled:
                         break
                     await url_queue.put(url)
@@ -770,14 +759,10 @@ class Fetcher:
                         and self._cache_manager
                         and len(discovered_for_resume) % 200 == 0
                     ):
-                        self._cache_manager.save_discovered_urls(
-                            list(discovered_for_resume), start_url
-                        )
+                        self._cache_manager.save_discovered_urls(list(discovered_for_resume), start_url)
             finally:
                 if self.config.cache.enabled and self._cache_manager:
-                    self._cache_manager.save_discovered_urls(
-                        discovered_for_resume, start_url
-                    )
+                    self._cache_manager.save_discovered_urls(discovered_for_resume, start_url)
                 self._stats.urls_discovered = len(discovered_for_resume)
                 await event_queue.put(
                     FetchEvent(
@@ -810,6 +795,7 @@ class Fetcher:
                     continue
                 local_events: list[FetchEvent] = []
                 # Bind the per-iteration list as a default arg so ruff B023
                 # is happy. Closure is consumed synchronously by execute()
                 # before the next iteration anyway, so capture order is safe.
@@ -936,9 +922,7 @@ def fetch_one(url: str, **kwargs: object) -> PageContext:
     """
     try:
         asyncio.get_running_loop()
-        raise RuntimeError(
-            "fetch_one() called from async context. Use Fetcher.fetch_one() instead."
-        )
+        raise RuntimeError("fetch_one() called from async context. Use Fetcher.fetch_one() instead.")
     except RuntimeError as exc:
         if "no running event loop" not in str(exc).lower():
             raise

{docpull-2.5.0 → docpull-3.0.0}/src/docpull/discovery/filters.py RENAMED Viewed

@@ -29,19 +29,20 @@ def normalize_url(url: str) -> str:
     Returns:
         Normalized URL string
     """
-    # Use url_normalize library if available
+    # Use url_normalize library if available for case / percent-encoding
+    # cleanup. It does NOT strip fragments, so we always do that ourselves
+    # below — keeping behavior consistent whether the optional dep is
+    # installed or not.
     if URL_NORMALIZE_AVAILABLE:
         try:
-            result: str = url_normalize(url)
-            return result
+            normalized = url_normalize(url)
+            if normalized:
+                url = normalized
         except ValueError:
             logger.debug("url_normalize rejected URL during normalization", exc_info=True)
-    # Basic normalization
     parsed = urlparse(url)
-    # Remove fragment
-    normalized = urlunparse(
+    return urlunparse(
         (
             parsed.scheme.lower(),
             parsed.netloc.lower(),
@@ -52,8 +53,6 @@ def normalize_url(url: str) -> str:
         )
     )
-    return normalized
 class PatternFilter:
     """

{docpull-2.5.0 → docpull-3.0.0}/src/docpull/http/client.py RENAMED Viewed

@@ -12,7 +12,7 @@ from types import TracebackType
 from urllib.parse import urljoin, urlparse
 import aiohttp
-from aiohttp.abc import AbstractResolver
+from aiohttp.abc import AbstractResolver, ResolveResult
 from ..security.url_validator import UrlValidator
 from .protocols import HttpResponse
@@ -45,14 +45,14 @@ class _ValidatedResolver(AbstractResolver):
         self,
         host: str,
         port: int = 0,
-        family: int = socket.AF_UNSPEC,
-    ) -> list[dict[str, object]]:
+        family: socket.AddressFamily = socket.AF_UNSPEC,
+    ) -> list[ResolveResult]:
         try:
             addresses = self._url_validator.resolve_allowed_addresses(host)
         except ValueError as err:
             raise OSError(str(err)) from err
-        results: list[dict[str, object]] = []
+        results: list[ResolveResult] = []
         for address in addresses:
             ip = ipaddress.ip_address(address)
             entry_family = socket.AF_INET6 if ip.version == 6 else socket.AF_INET
@@ -60,14 +60,14 @@ class _ValidatedResolver(AbstractResolver):
                 continue
             results.append(
-                {
-                    "hostname": host,
-                    "host": address,
-                    "port": port,
-                    "family": entry_family,
-                    "proto": socket.IPPROTO_TCP,
-                    "flags": socket.AI_NUMERICHOST,
-                }
+                ResolveResult(
+                    hostname=host,
+                    host=address,
+                    port=port,
+                    family=entry_family,
+                    proto=socket.IPPROTO_TCP,
+                    flags=socket.AI_NUMERICHOST,
+                )
             )
         if not results:
@@ -236,20 +236,21 @@ class AsyncHttpClient:
     async def __aenter__(self) -> AsyncHttpClient:
         """Enter async context and create session."""
-        connector_kwargs: dict[str, object] = {
-            "limit": 100,  # Total connection limit
-            "limit_per_host": 10,  # Per-host connection limit
-            "ttl_dns_cache": 300,  # DNS cache TTL
-        }
+        resolver: AbstractResolver | None = None
         if self._url_validator is not None and self._proxy is None:
-            connector_kwargs["resolver"] = _ValidatedResolver(self._url_validator)
+            resolver = _ValidatedResolver(self._url_validator)
         elif self._proxy is not None and self._url_validator is not None:
             logger.warning(
                 "Proxy mode: DNS-pinning resolver is not active. "
                 "URL validation still runs pre-flight, but the proxy resolves DNS independently."
             )
-        connector = aiohttp.TCPConnector(**connector_kwargs)
+        connector = aiohttp.TCPConnector(
+            limit=100,
+            limit_per_host=10,
+            ttl_dns_cache=300,
+            resolver=resolver,
+        )
         self._session = aiohttp.ClientSession(
             connector=connector,
             headers={"User-Agent": self._user_agent},

{docpull-2.5.0 → docpull-3.0.0}/src/docpull/mcp/server.py RENAMED Viewed

@@ -103,7 +103,11 @@ _GREP_DOCS_OUTPUT_SCHEMA = {
             "items": {
                 "type": "object",
                 "properties": {
-                    "path": {"type": "string"},
+                    "library": {"type": "string"},
+                    "path": {
+                        "type": "string",
+                        "description": "Relative to the library root; pass directly to read_doc",
+                    },
                     "match_count": {"type": "integer"},
                     "matches": {
                         "type": "array",
@@ -119,7 +123,7 @@ _GREP_DOCS_OUTPUT_SCHEMA = {
                         },
                     },
                 },
-                "required": ["path", "match_count", "matches"],
+                "required": ["library", "path", "match_count", "matches"],
             },
         },
         "truncated": {"type": "boolean"},
@@ -211,8 +215,7 @@ async def _run_stdio() -> int:
         from mcp.types import CallToolResult, TextContent, Tool, ToolAnnotations
     except ImportError:
         print(
-            "docpull mcp requires the 'mcp' package. Install with: "
-            "pip install docpull[mcp]",
+            "docpull mcp requires the 'mcp' package. Install with: pip install docpull[mcp]",
             file=sys.stderr,
         )
         return 1
@@ -333,8 +336,9 @@ async def _run_stdio() -> int:
                 description=(
                     "Regex search through fetched Markdown. Results are ranked by "
                     "match density (most matches per file first) and rendered with "
-                    "lines of surrounding context. Use ensure_docs first; then "
-                    "read_doc to pull more context around a hit."
+                    "lines of surrounding context. Each result returns the library "
+                    "and a path relative to the library root, so you can feed both "
+                    "fields straight into read_doc. Use ensure_docs first."
                 ),
                 annotations=ToolAnnotations(
                     title="Regex-search cached docs",
@@ -370,8 +374,9 @@ async def _run_stdio() -> int:
                 name="read_doc",
                 description=(
                     "Read a Markdown file from a fetched library, optionally sliced "
-                    "by line range. The natural follow-up to grep_docs: pass the "
-                    "library + path it returned to pull more surrounding context."
+                    "by line range. The natural follow-up to grep_docs: pass each "
+                    "result's library and path (path is already relative to the "
+                    "library root) to pull more surrounding context."
                 ),
                 annotations=ToolAnnotations(
                     title="Read a cached doc file",
@@ -584,7 +589,10 @@ async def _run_stdio() -> int:
         #     isError=False), and
         # (b) errors on tools with an outputSchema don't fail the validator
         #     for "missing structured content."
-        content = [TextContent(type="text", text=result.text)]
+        # `content` is typed `list[TextContent | ImageContent | ...]` on the SDK
+        # side; list invariance means we have to widen the local annotation
+        # explicitly even though TextContent is one of the valid variants.
+        content: list[Any] = [TextContent(type="text", text=result.text)]
         return CallToolResult(
             content=content,
             structuredContent=result.data if not result.is_error else None,

docpull 2.5.0__tar.gz → 3.0.0__tar.gz

docpull 2.5.0tar.gz → 3.0.0tar.gz