PyPI - docpull - Versions diffs - 3.0.2__tar.gz → 4.0.1__tar.gz - Mend

docpull 3.0.2tar.gz → 4.0.1tar.gz

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (98) hide show

{docpull-3.0.2/src/docpull.egg-info → docpull-4.0.1}/PKG-INFO RENAMED Viewed

@@ -1,6 +1,6 @@
 Metadata-Version: 2.4
 Name: docpull
-Version: 3.0.2
+Version: 4.0.1
 Summary: Pull documentation from the web and convert to clean markdown
 Author-email: Zachary Roth <support@raintree.technology>
 Maintainer-email: Raintree Technology <support@raintree.technology>
@@ -42,7 +42,7 @@ Requires-Dist: beautifulsoup4>=4.12.0
 Requires-Dist: html2text>=2020.1.16
 Requires-Dist: defusedxml>=0.7.1
 Requires-Dist: extruct>=0.15.0
-Requires-Dist: aiohttp>=3.9.0
+Requires-Dist: aiohttp>=3.14.0
 Requires-Dist: idna>=3.15
 Requires-Dist: regex>=2024.11.6
 Requires-Dist: rich>=13.0.0
@@ -59,6 +59,7 @@ Provides-Extra: tokens
 Requires-Dist: tiktoken>=0.7.0; extra == "tokens"
 Provides-Extra: mcp
 Requires-Dist: mcp>=1.0.0; extra == "mcp"
+Requires-Dist: pyjwt>=2.13.0; extra == "mcp"
 Requires-Dist: python-multipart>=0.0.27; extra == "mcp"
 Requires-Dist: starlette>=1.0.1; extra == "mcp"
 Provides-Extra: llm
@@ -69,6 +70,7 @@ Requires-Dist: url-normalize>=1.4.0; extra == "all"
 Requires-Dist: trafilatura>=1.12.0; extra == "all"
 Requires-Dist: tiktoken>=0.7.0; extra == "all"
 Requires-Dist: mcp>=1.0.0; extra == "all"
+Requires-Dist: pyjwt>=2.13.0; extra == "all"
 Requires-Dist: python-multipart>=0.0.27; extra == "all"
 Requires-Dist: starlette>=1.0.1; extra == "all"
 Provides-Extra: dev
@@ -150,7 +152,7 @@ content directly from framework data feeds:
 | Mintlify  | `__NEXT_DATA__` with Mintlify tagging |
 | OpenAPI   | Renders `openapi.json` / `swagger.json` into Markdown |
 | Docusaurus| Detected and tagged; generic extractor produces Markdown |
-| Sphinx    | Detected and tagged; generic extractor produces Markdown |
+| Sphinx    | Detected from generator metadata / Read the Docs hosts and tagged; generic extractor produces Markdown |
 JS-only SPAs with no server-rendered content are detected and skipped with a
 clear reason (or, with `--strict-js-required`, reported as an error so agents
@@ -213,8 +215,8 @@ async def tool_call(url: str) -> str:
 ```bash
 docpull https://site.com --profile rag      # Default. Dedup, rich metadata.
-docpull https://site.com --profile llm      # NDJSON + chunks + metadata.
-docpull https://site.com --profile mirror   # Full archive, polite, cached.
+docpull https://site.com --profile llm      # NDJSON + chunks + metadata; JS-only pages skip unless --strict-js-required is passed.
+docpull https://site.com --profile mirror   # Full archive, polite, cached, hierarchical paths.
 docpull https://site.com --profile quick    # Sampling: 50 pages, depth 2.
 ```
@@ -283,7 +285,9 @@ Write:
 - `add_source(name, url, description?, category?, max_pages?, force?)` — register a user alias (HTTPS-only, atomic write to `sources.yaml`).
 - `remove_source(name, delete_cache?)` — drop a user alias and (optionally) its cached docs.
-All tools that carry data also return `structuredContent` validated against an `outputSchema` for clients that prefer typed output.
+All schema-backed tools return `structuredContent` validated against an
+`outputSchema` for clients that prefer typed output. `fetch_url` intentionally
+returns Markdown text directly.
 User-defined sources live in `~/.config/docpull-mcp/sources.yaml`:
@@ -296,16 +300,17 @@ sources:
     maxPages: 200
 ```
-### About the `mcp/` directory in this repo
+### Supported MCP path
-The `mcp/` directory at the repo root is a separate TypeScript + Bun MCP
-server backed by PostgreSQL with pgvector for semantic search. It is not
-the Python MCP server shipped in the `docpull` package described above
-— that one is the right choice for almost every user and is installed
-with `pip install 'docpull[mcp]'`. The `mcp/` tree is mirrored to its
-own repo at [`raintree-technology/docpull-mcp`](https://github.com/raintree-technology/docpull-mcp);
-unless you specifically need pgvector-backed semantic search, ignore it
-and use `docpull mcp`.
+The supported MCP server is the Python stdio server started by `docpull mcp`.
+That is the only MCP path covered by the `docpull` package release contract and
+the one agents, plugin users, Claude Code, Cursor, and Claude Desktop should
+use.
+This repository also contains an `mcp/` directory with an internal TypeScript +
+Bun lab for PostgreSQL/pgvector semantic search. It is not shipped by the Python
+package, is not documented as a user install path, and should be ignored unless
+you are explicitly developing that lab.
 ## Output
@@ -325,9 +330,14 @@ source_type: "nextjs"
 NDJSON (one record per page or chunk):
 ```json
-{"url": "...", "title": "...", "content": "...", "hash": "...", "token_count": 842, "chunk_index": 0}
+{"document_id": "doc_...", "chunk_id": "chunk_...", "url": "...", "title": "...", "content": "...", "hash": "...", "token_count": 842, "chunk_index": 0}
 ```
+Every output format also writes `corpus.manifest.json` next to the generated
+documents. The manifest records the run identity, output format, stable
+`document_id` / `chunk_id` values, content hashes, relative output paths, and
+chunk counts so regenerated corpora can be diffed and cited by agents.
 ## Security
 - HTTPS-only, mandatory robots.txt compliance
@@ -347,7 +357,7 @@ Run `docpull --help` for the full list. Highlights:
 ```
 Core:
-  --profile {rag,mirror,quick,llm,custom}
+  --profile {rag,mirror,quick,llm}
   --single                Fetch one URL (no crawl)
   --format {markdown,json,ndjson,sqlite}
   --stream                Stream NDJSON to stdout
@@ -366,27 +376,33 @@ Cache:
   --cache                 Enable incremental updates
   --cache-dir DIR
   --cache-ttl DAYS
+Crawl:
+  --max-concurrent N      Global request concurrency
+  --per-host-concurrent N Per-host request concurrency
 ```
 ## Performance
 End-to-end numbers from `tests/benchmarks/test_10k_pages.py` against a
 synthetic 10,000-page localhost site (RAG profile, `max_concurrent=50`,
-HTTP keep-alive, 5% injected duplicate content):
+`per_host_concurrent=50`, HTTP keep-alive, 5% injected duplicate content).
+The benchmark emits progress every 1,000 pages plus a final JSON report for
+trend tooling.
 | Metric | Value |
 |---|---|
-| Total wall time | ~27 s |
-| Discovery (sitemap parse) | ~80 ms |
-| Fetch + convert + save | ~27 s |
-| Per-page latency p50 / p95 / p99 | ~2.6 / 4.6 / 5.3 ms |
-| Peak RSS delta from baseline | ~28 MB |
-| Cache manifest size on disk | ~3.4 MB |
+| Total wall time | ~333 s |
+| Pages fetched / skipped / failed | 9,501 / 499 / 0 |
+| Time to first saved page | ~130 ms |
+| Per-page latency p50 / p95 / p99 | ~0 / 166 / 232 ms |
+| Peak RSS delta from baseline | ~94 MB |
+| Cache manifest size on disk | ~8.9 MB |
 | Duplicates detected (5% injected) | 499 / 500 |
 Reproduce with `make benchmark` (requires `aiohttp`; runs the gated
-benchmark in `tests/benchmarks/` and prints a JSON line you can pipe
-into trend tooling).
+benchmark in `tests/benchmarks/` and prints progress plus a JSON line you can
+pipe into trend tooling).
 ## Troubleshooting

{docpull-3.0.2 → docpull-4.0.1}/README.md RENAMED Viewed

@@ -62,7 +62,7 @@ content directly from framework data feeds:
 | Mintlify  | `__NEXT_DATA__` with Mintlify tagging |
 | OpenAPI   | Renders `openapi.json` / `swagger.json` into Markdown |
 | Docusaurus| Detected and tagged; generic extractor produces Markdown |
-| Sphinx    | Detected and tagged; generic extractor produces Markdown |
+| Sphinx    | Detected from generator metadata / Read the Docs hosts and tagged; generic extractor produces Markdown |
 JS-only SPAs with no server-rendered content are detected and skipped with a
 clear reason (or, with `--strict-js-required`, reported as an error so agents
@@ -125,8 +125,8 @@ async def tool_call(url: str) -> str:
 ```bash
 docpull https://site.com --profile rag      # Default. Dedup, rich metadata.
-docpull https://site.com --profile llm      # NDJSON + chunks + metadata.
-docpull https://site.com --profile mirror   # Full archive, polite, cached.
+docpull https://site.com --profile llm      # NDJSON + chunks + metadata; JS-only pages skip unless --strict-js-required is passed.
+docpull https://site.com --profile mirror   # Full archive, polite, cached, hierarchical paths.
 docpull https://site.com --profile quick    # Sampling: 50 pages, depth 2.
 ```
@@ -195,7 +195,9 @@ Write:
 - `add_source(name, url, description?, category?, max_pages?, force?)` — register a user alias (HTTPS-only, atomic write to `sources.yaml`).
 - `remove_source(name, delete_cache?)` — drop a user alias and (optionally) its cached docs.
-All tools that carry data also return `structuredContent` validated against an `outputSchema` for clients that prefer typed output.
+All schema-backed tools return `structuredContent` validated against an
+`outputSchema` for clients that prefer typed output. `fetch_url` intentionally
+returns Markdown text directly.
 User-defined sources live in `~/.config/docpull-mcp/sources.yaml`:
@@ -208,16 +210,17 @@ sources:
     maxPages: 200
 ```
-### About the `mcp/` directory in this repo
+### Supported MCP path
-The `mcp/` directory at the repo root is a separate TypeScript + Bun MCP
-server backed by PostgreSQL with pgvector for semantic search. It is not
-the Python MCP server shipped in the `docpull` package described above
-— that one is the right choice for almost every user and is installed
-with `pip install 'docpull[mcp]'`. The `mcp/` tree is mirrored to its
-own repo at [`raintree-technology/docpull-mcp`](https://github.com/raintree-technology/docpull-mcp);
-unless you specifically need pgvector-backed semantic search, ignore it
-and use `docpull mcp`.
+The supported MCP server is the Python stdio server started by `docpull mcp`.
+That is the only MCP path covered by the `docpull` package release contract and
+the one agents, plugin users, Claude Code, Cursor, and Claude Desktop should
+use.
+This repository also contains an `mcp/` directory with an internal TypeScript +
+Bun lab for PostgreSQL/pgvector semantic search. It is not shipped by the Python
+package, is not documented as a user install path, and should be ignored unless
+you are explicitly developing that lab.
 ## Output
@@ -237,9 +240,14 @@ source_type: "nextjs"
 NDJSON (one record per page or chunk):
 ```json
-{"url": "...", "title": "...", "content": "...", "hash": "...", "token_count": 842, "chunk_index": 0}
+{"document_id": "doc_...", "chunk_id": "chunk_...", "url": "...", "title": "...", "content": "...", "hash": "...", "token_count": 842, "chunk_index": 0}
 ```
+Every output format also writes `corpus.manifest.json` next to the generated
+documents. The manifest records the run identity, output format, stable
+`document_id` / `chunk_id` values, content hashes, relative output paths, and
+chunk counts so regenerated corpora can be diffed and cited by agents.
 ## Security
 - HTTPS-only, mandatory robots.txt compliance
@@ -259,7 +267,7 @@ Run `docpull --help` for the full list. Highlights:
 ```
 Core:
-  --profile {rag,mirror,quick,llm,custom}
+  --profile {rag,mirror,quick,llm}
   --single                Fetch one URL (no crawl)
   --format {markdown,json,ndjson,sqlite}
   --stream                Stream NDJSON to stdout
@@ -278,27 +286,33 @@ Cache:
   --cache                 Enable incremental updates
   --cache-dir DIR
   --cache-ttl DAYS
+Crawl:
+  --max-concurrent N      Global request concurrency
+  --per-host-concurrent N Per-host request concurrency
 ```
 ## Performance
 End-to-end numbers from `tests/benchmarks/test_10k_pages.py` against a
 synthetic 10,000-page localhost site (RAG profile, `max_concurrent=50`,
-HTTP keep-alive, 5% injected duplicate content):
+`per_host_concurrent=50`, HTTP keep-alive, 5% injected duplicate content).
+The benchmark emits progress every 1,000 pages plus a final JSON report for
+trend tooling.
 | Metric | Value |
 |---|---|
-| Total wall time | ~27 s |
-| Discovery (sitemap parse) | ~80 ms |
-| Fetch + convert + save | ~27 s |
-| Per-page latency p50 / p95 / p99 | ~2.6 / 4.6 / 5.3 ms |
-| Peak RSS delta from baseline | ~28 MB |
-| Cache manifest size on disk | ~3.4 MB |
+| Total wall time | ~333 s |
+| Pages fetched / skipped / failed | 9,501 / 499 / 0 |
+| Time to first saved page | ~130 ms |
+| Per-page latency p50 / p95 / p99 | ~0 / 166 / 232 ms |
+| Peak RSS delta from baseline | ~94 MB |
+| Cache manifest size on disk | ~8.9 MB |
 | Duplicates detected (5% injected) | 499 / 500 |
 Reproduce with `make benchmark` (requires `aiohttp`; runs the gated
-benchmark in `tests/benchmarks/` and prints a JSON line you can pipe
-into trend tooling).
+benchmark in `tests/benchmarks/` and prints progress plus a JSON line you can
+pipe into trend tooling).
 ## Troubleshooting

{docpull-3.0.2 → docpull-4.0.1}/pyproject.toml RENAMED Viewed

@@ -4,7 +4,7 @@ build-backend = "setuptools.build_meta"
 [project]
 name = "docpull"
-version = "3.0.2"
+version = "4.0.1"
 dynamic = []
 description = "Pull documentation from the web and convert to clean markdown"
 readme = {file = "README.md", content-type = "text/markdown"}
@@ -66,7 +66,7 @@ dependencies = [
     "html2text>=2020.1.16",
     "defusedxml>=0.7.1",
     "extruct>=0.15.0",
-    "aiohttp>=3.9.0",
+    "aiohttp>=3.14.0",  # 3.14.0 fixes CVE-2026-34993 and CVE-2026-47265
     "idna>=3.15",
     "regex>=2024.11.6",
     "rich>=13.0.0",
@@ -90,6 +90,7 @@ tokens = [
 ]
 mcp = [
     "mcp>=1.0.0",
+    "pyjwt>=2.13.0",
     "python-multipart>=0.0.27",
     "starlette>=1.0.1",
 ]
@@ -102,6 +103,7 @@ all = [
     "trafilatura>=1.12.0",
     "tiktoken>=0.7.0",
     "mcp>=1.0.0",
+    "pyjwt>=2.13.0",
     "python-multipart>=0.0.27",
     "starlette>=1.0.1",
 ]

{docpull-3.0.2 → docpull-4.0.1}/src/docpull/__init__.py RENAMED Viewed

@@ -14,7 +14,7 @@ Usage:
             print(event)
 """
-__version__ = "3.0.2"
+__version__ = "4.0.1"
 from .cache import CacheManager, StreamingDeduplicator
 from .conversion.chunking import Chunk, TokenCounter, chunk_markdown
@@ -34,12 +34,10 @@ from .pipeline.base import PageContext
 __all__ = [
     "__version__",
-    # Core
     "Fetcher",
     "fetch_blocking",
     "fetch_one",
     "PageContext",
-    # Config
     "DocpullConfig",
     "ProfileName",
     "CrawlConfig",
@@ -48,14 +46,11 @@ __all__ = [
     "NetworkConfig",
     "PerformanceConfig",
     "CacheConfig",
-    # Events
     "EventType",
     "FetchEvent",
     "FetchStats",
-    # Cache
     "CacheManager",
     "StreamingDeduplicator",
-    # Chunking
     "Chunk",
     "TokenCounter",
     "chunk_markdown",

{docpull-3.0.2 → docpull-4.0.1}/src/docpull/cache/__init__.py RENAMED Viewed

@@ -1,5 +1,6 @@
 """Caching and deduplication for docpull."""
+from .frontier import FrontierEntry, FrontierState, FrontierStore
 from .manager import DEFAULT_TTL_DAYS, CacheManager, CacheState, ManifestEntry
 from .streaming_dedup import StreamingDeduplicator
@@ -7,6 +8,9 @@ __all__ = [
     "CacheManager",
     "CacheState",
     "ManifestEntry",
+    "FrontierEntry",
+    "FrontierState",
+    "FrontierStore",
     "StreamingDeduplicator",
     "DEFAULT_TTL_DAYS",
 ]

docpull-4.0.1/src/docpull/cache/frontier.py ADDED Viewed

@@ -0,0 +1,199 @@
+"""Durable crawl frontier state for pause/resume and provenance."""
+from __future__ import annotations
+import json
+import logging
+from dataclasses import dataclass, field
+from enum import Enum
+from pathlib import Path
+from typing import Any
+from ..models.run import FRONTIER_SCHEMA_VERSION
+from ..time_utils import utc_now_iso
+logger = logging.getLogger(__name__)
+class FrontierState(str, Enum):
+    """Lifecycle state for a URL in the crawl frontier."""
+    QUEUED = "queued"
+    PROCESSING = "processing"
+    SUCCEEDED = "succeeded"
+    SKIPPED = "skipped"
+    FAILED = "failed"
+@dataclass
+class FrontierEntry:
+    url: str
+    state: FrontierState = FrontierState.QUEUED
+    depth: int | None = None
+    source: str | None = None
+    discovered_at: str = field(default_factory=utc_now_iso)
+    updated_at: str = field(default_factory=utc_now_iso)
+    attempts: int = 0
+    last_error: str | None = None
+    @classmethod
+    def from_json(cls, data: dict[str, Any]) -> FrontierEntry | None:
+        url = data.get("url")
+        if not isinstance(url, str):
+            return None
+        try:
+            state = FrontierState(str(data.get("state", FrontierState.QUEUED.value)))
+        except ValueError:
+            state = FrontierState.QUEUED
+        attempts = data.get("attempts")
+        discovered_at = data.get("discovered_at")
+        updated_at = data.get("updated_at")
+        return cls(
+            url=url,
+            state=state,
+            depth=data.get("depth") if isinstance(data.get("depth"), int) else None,
+            source=data.get("source") if isinstance(data.get("source"), str) else None,
+            discovered_at=discovered_at if isinstance(discovered_at, str) else utc_now_iso(),
+            updated_at=updated_at if isinstance(updated_at, str) else utc_now_iso(),
+            attempts=attempts if isinstance(attempts, int) else 0,
+            last_error=data.get("last_error") if isinstance(data.get("last_error"), str) else None,
+        )
+    def to_json(self) -> dict[str, Any]:
+        return {
+            "url": self.url,
+            "state": self.state.value,
+            "depth": self.depth,
+            "source": self.source,
+            "discovered_at": self.discovered_at,
+            "updated_at": self.updated_at,
+            "attempts": self.attempts,
+            "last_error": self.last_error,
+        }
+class FrontierStore:
+    """Small JSON-backed frontier store.
+    The store is intentionally simple because docpull is single-process today.
+    It gives us explicit URL lifecycle state and a compatibility fingerprint
+    without introducing a queue service or SQLite dependency for markdown users.
+    """
+    def __init__(self, path: Path):
+        self.path = Path(path)
+        self.entries: dict[str, FrontierEntry] = {}
+        self.start_url: str | None = None
+        self.run_fingerprint: dict[str, object] | None = None
+        self.created_at: str | None = None
+        self.updated_at: str | None = None
+        self._load()
+    def _load(self) -> None:
+        if not self.path.exists():
+            return
+        try:
+            data = json.loads(self.path.read_text(encoding="utf-8"))
+        except (OSError, json.JSONDecodeError) as err:
+            logger.warning("Could not load frontier store %s: %s", self.path, err)
+            return
+        if not isinstance(data, dict) or data.get("schema_version") != FRONTIER_SCHEMA_VERSION:
+            return
+        entries = data.get("entries")
+        if not isinstance(entries, list):
+            return
+        self.start_url = data.get("start_url") if isinstance(data.get("start_url"), str) else None
+        fingerprint = data.get("run_fingerprint")
+        self.run_fingerprint = fingerprint if isinstance(fingerprint, dict) else None
+        self.created_at = data.get("created_at") if isinstance(data.get("created_at"), str) else None
+        self.updated_at = data.get("updated_at") if isinstance(data.get("updated_at"), str) else None
+        for item in entries:
+            if not isinstance(item, dict):
+                continue
+            entry = FrontierEntry.from_json(item)
+            if entry:
+                self.entries[entry.url] = entry
+    def save(self) -> None:
+        self.path.parent.mkdir(parents=True, exist_ok=True)
+        now = utc_now_iso()
+        if self.created_at is None:
+            self.created_at = now
+        self.updated_at = now
+        data = {
+            "schema_version": FRONTIER_SCHEMA_VERSION,
+            "start_url": self.start_url,
+            "run_fingerprint": self.run_fingerprint,
+            "created_at": self.created_at,
+            "updated_at": self.updated_at,
+            "entries": [entry.to_json() for entry in self.entries.values()],
+        }
+        tmp = self.path.with_suffix(self.path.suffix + ".tmp")
+        try:
+            tmp.write_text(json.dumps(data, indent=2, ensure_ascii=False) + "\n", encoding="utf-8")
+            tmp.replace(self.path)
+        except Exception:
+            tmp.unlink(missing_ok=True)
+            raise
+    def initialize(self, *, start_url: str, run_fingerprint: dict[str, object]) -> None:
+        if self.start_url != start_url or self.run_fingerprint != run_fingerprint:
+            self.entries.clear()
+            self.created_at = utc_now_iso()
+        self.start_url = start_url
+        self.run_fingerprint = run_fingerprint
+        self.save()
+    def compatible(self, *, start_url: str, run_fingerprint: dict[str, object]) -> bool:
+        return self.start_url == start_url and self.run_fingerprint == run_fingerprint
+    def add(self, url: str, *, depth: int | None = None, source: str | None = None) -> None:
+        if url in self.entries:
+            return
+        self.entries[url] = FrontierEntry(url=url, depth=depth, source=source)
+    def add_many(self, urls: list[str], *, source: str | None = None) -> None:
+        for url in urls:
+            self.add(url, source=source)
+    def mark_processing(self, url: str) -> None:
+        entry = self.entries.get(url)
+        if not entry:
+            self.add(url)
+            entry = self.entries[url]
+        entry.state = FrontierState.PROCESSING
+        entry.attempts += 1
+        entry.updated_at = utc_now_iso()
+        self.save()
+    def mark_succeeded(self, url: str) -> None:
+        self._mark_terminal(url, FrontierState.SUCCEEDED)
+    def mark_skipped(self, url: str) -> None:
+        self._mark_terminal(url, FrontierState.SKIPPED)
+    def mark_failed(self, url: str, error: str | None = None) -> None:
+        self._mark_terminal(url, FrontierState.FAILED, error=error)
+    def _mark_terminal(self, url: str, state: FrontierState, error: str | None = None) -> None:
+        entry = self.entries.get(url)
+        if not entry:
+            self.add(url)
+            entry = self.entries[url]
+        entry.state = state
+        entry.last_error = error
+        entry.updated_at = utc_now_iso()
+        self.save()
+    def pending_urls(self) -> list[str]:
+        terminal = {FrontierState.SUCCEEDED, FrontierState.SKIPPED}
+        return [url for url, entry in self.entries.items() if entry.state not in terminal]
+    def clear(self) -> None:
+        if self.path.exists():
+            self.path.unlink()
+        self.entries.clear()
+        self.start_url = None
+        self.run_fingerprint = None
+        self.created_at = None
+        self.updated_at = None

docpull 3.0.2__tar.gz → 4.0.1__tar.gz

docpull 3.0.2tar.gz → 4.0.1tar.gz