PyPI - docpull - Versions diffs - 2.2.0__tar.gz → 2.4.0__tar.gz - Mend

docpull 2.2.0tar.gz → 2.4.0tar.gz

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (100) hide show

docpull-2.4.0/PKG-INFO ADDED Viewed

@@ -0,0 +1,356 @@
+Metadata-Version: 2.4
+Name: docpull
+Version: 2.4.0
+Summary: Pull documentation from the web and convert to clean markdown
+Author-email: Zachary Roth <support@raintree.technology>
+Maintainer-email: Raintree Technology <support@raintree.technology>
+License-Expression: MIT
+Project-URL: Homepage, https://github.com/raintree-technology/docpull
+Project-URL: Documentation, https://github.com/raintree-technology/docpull#readme
+Project-URL: Repository, https://github.com/raintree-technology/docpull
+Project-URL: Source Code, https://github.com/raintree-technology/docpull
+Project-URL: Bug Tracker, https://github.com/raintree-technology/docpull/issues
+Project-URL: Releases, https://github.com/raintree-technology/docpull/releases
+Keywords: python,markdown,documentation,web-scraping,developer-tools,claude,ai-training-data
+Classifier: Development Status :: 5 - Production/Stable
+Classifier: Intended Audience :: Developers
+Classifier: Intended Audience :: Information Technology
+Classifier: Intended Audience :: Science/Research
+Classifier: Intended Audience :: Education
+Classifier: Environment :: Console
+Classifier: Topic :: Documentation
+Classifier: Topic :: Internet :: WWW/HTTP :: Indexing/Search
+Classifier: Topic :: Software Development :: Documentation
+Classifier: Topic :: Text Processing :: Markup :: HTML
+Classifier: Topic :: Text Processing :: Markup :: Markdown
+Classifier: Topic :: Utilities
+Classifier: Topic :: Scientific/Engineering :: Artificial Intelligence
+Classifier: Natural Language :: English
+Classifier: Operating System :: OS Independent
+Classifier: Programming Language :: Python :: 3
+Classifier: Programming Language :: Python :: 3.10
+Classifier: Programming Language :: Python :: 3.11
+Classifier: Programming Language :: Python :: 3.12
+Classifier: Programming Language :: Python :: 3.13
+Classifier: Programming Language :: Python :: 3.14
+Classifier: Programming Language :: Python :: 3 :: Only
+Classifier: Typing :: Typed
+Requires-Python: >=3.10
+Description-Content-Type: text/markdown
+License-File: LICENSE
+Requires-Dist: beautifulsoup4>=4.12.0
+Requires-Dist: html2text>=2020.1.16
+Requires-Dist: defusedxml>=0.7.1
+Requires-Dist: extruct>=0.15.0
+Requires-Dist: aiohttp>=3.9.0
+Requires-Dist: rich>=13.0.0
+Requires-Dist: pyyaml>=6.0
+Requires-Dist: pydantic>=2.0
+Provides-Extra: proxy
+Requires-Dist: aiohttp-socks>=0.8.0; extra == "proxy"
+Provides-Extra: normalize
+Requires-Dist: url-normalize>=1.4.0; extra == "normalize"
+Provides-Extra: trafilatura
+Requires-Dist: trafilatura>=1.12.0; extra == "trafilatura"
+Provides-Extra: tokens
+Requires-Dist: tiktoken>=0.7.0; extra == "tokens"
+Provides-Extra: mcp
+Requires-Dist: mcp>=1.0.0; extra == "mcp"
+Provides-Extra: llm
+Requires-Dist: tiktoken>=0.7.0; extra == "llm"
+Provides-Extra: all
+Requires-Dist: aiohttp-socks>=0.8.0; extra == "all"
+Requires-Dist: url-normalize>=1.4.0; extra == "all"
+Requires-Dist: trafilatura>=1.12.0; extra == "all"
+Requires-Dist: tiktoken>=0.7.0; extra == "all"
+Requires-Dist: mcp>=1.0.0; extra == "all"
+Provides-Extra: dev
+Requires-Dist: pytest>=7.0.0; extra == "dev"
+Requires-Dist: pytest-cov>=4.0.0; extra == "dev"
+Requires-Dist: pytest-asyncio>=0.21.0; extra == "dev"
+Requires-Dist: black>=23.0.0; extra == "dev"
+Requires-Dist: mypy>=1.0.0; extra == "dev"
+Requires-Dist: ruff>=0.1.0; extra == "dev"
+Requires-Dist: bandit>=1.7.0; extra == "dev"
+Requires-Dist: pip-audit>=2.0.0; extra == "dev"
+Requires-Dist: pre-commit>=3.0.0; extra == "dev"
+Requires-Dist: types-requests>=2.31.0; extra == "dev"
+Requires-Dist: types-beautifulsoup4>=4.12.0; extra == "dev"
+Requires-Dist: types-defusedxml>=0.7.0; extra == "dev"
+Requires-Dist: types-pyyaml>=6.0.0; extra == "dev"
+Dynamic: license-file
+# docpull
+**Security-hardened, browser-free crawler that turns static documentation sites into clean, AI-ready Markdown — fast.**
+[![Python 3.10+](https://img.shields.io/badge/python-3.10+-blue.svg)](https://www.python.org/downloads/)
+[![PyPI version](https://badge.fury.io/py/docpull.svg)](https://badge.fury.io/py/docpull)
+[![Downloads](https://pepy.tech/badge/docpull)](https://pepy.tech/project/docpull)
+[![License: MIT](https://img.shields.io/github/license/raintree-technology/docpull)](https://github.com/raintree-technology/docpull/blob/main/LICENSE)
+<p align="center">
+  <a href="https://docpull.raintree.technology">
+    <img src="https://pub-e85a1abca36f4fd8b4300a6ec2d6f45f.r2.dev/marketing/docpull/1768954147343-iaiziy-docpull-terminal-hero.gif" alt="docpull demo" width="600">
+  </a>
+</p>
+docpull uses async HTTP (not Playwright) to fetch server-rendered pages,
+extracts main content, and writes clean Markdown with source-URL frontmatter —
+in seconds, with a small install footprint. It won't render JavaScript, but for
+the large class of docs that don't need it (API references, Python/Go stdlib,
+most dev-tool docs, OpenAPI specs, Next.js and Docusaurus builds), it is a
+fast, auditable, sandbox-friendly way to pipe documentation into an LLM context,
+a RAG index, or an offline archive. SSRF, XXE, DNS-rebinding, and
+CRLF-injection protections are on by default — a necessity when an AI agent
+is choosing the URLs.
+## Install
+```bash
+pip install docpull
+# Optional extras
+pip install 'docpull[llm]'           # tiktoken for token-accurate chunking
+pip install 'docpull[trafilatura]'   # alternative extractor for noisy pages
+pip install 'docpull[mcp]'           # run as an MCP server for AI agents
+pip install 'docpull[all]'           # everything above
+```
+## Quick start
+```bash
+# Crawl and save Markdown
+docpull https://docs.example.com
+# One page, no crawl — the fast path for agents
+docpull https://docs.example.com/guide --single
+# LLM-ready NDJSON with 4k-token chunks streamed to stdout
+docpull https://docs.example.com --profile llm --stream | jq .
+# Mirror a site for offline use
+docpull https://docs.example.com --profile mirror --cache
+```
+## Framework-aware extraction
+docpull inspects each page before running the generic extractor and can pull
+content directly from framework data feeds:
+| Framework | Strategy |
+|-----------|----------|
+| Next.js   | Parses `__NEXT_DATA__` JSON |
+| Mintlify  | `__NEXT_DATA__` with Mintlify tagging |
+| OpenAPI   | Renders `openapi.json` / `swagger.json` into Markdown |
+| Docusaurus| Detected and tagged; generic extractor produces Markdown |
+| Sphinx    | Detected and tagged; generic extractor produces Markdown |
+JS-only SPAs with no server-rendered content are detected and skipped with a
+clear reason (or, with `--strict-js-required`, reported as an error so agents
+can route elsewhere).
+## Agent-friendly features
+- **`--single`** — fetch a single URL without discovery. Designed for tool loops.
+- **`--stream`** — NDJSON one-record-per-line, flushed on every page, pipeable.
+- **`--max-tokens-per-file N`** — split each page into token-bounded chunks on
+  heading boundaries (exact counts with tiktoken, estimate without).
+- **`--emit-chunks`** — write one file or record per chunk instead of per page.
+- **`--strict-js-required`** — hard-fail on JS-only pages instead of silently
+  skipping.
+- **`--extractor trafilatura`** — swap in [trafilatura](https://trafilatura.readthedocs.io/)
+  for sites where the default heuristics struggle.
+## Python API
+```python
+from docpull import fetch_one
+ctx = fetch_one("https://docs.python.org/3/library/asyncio.html")
+print(ctx.title, ctx.source_type)
+print(ctx.markdown[:500])
+```
+Async streaming:
+```python
+import asyncio
+from docpull import Fetcher, DocpullConfig, ProfileName, EventType
+async def main():
+    cfg = DocpullConfig(
+        url="https://docs.example.com",
+        profile=ProfileName.LLM,  # chunked NDJSON output
+    )
+    async with Fetcher(cfg) as fetcher:
+        async for event in fetcher.run():
+            if event.type == EventType.FETCH_PROGRESS:
+                print(f"{event.current}/{event.total}: {event.url}")
+        print(f"Done: {fetcher.stats.pages_fetched} pages")
+asyncio.run(main())
+```
+Single-page from an agent tool:
+```python
+from docpull import Fetcher, DocpullConfig
+async def tool_call(url: str) -> str:
+    async with Fetcher(DocpullConfig(url=url)) as f:
+        ctx = await f.fetch_one(url, save=False)
+        return ctx.markdown or ctx.error or ""
+```
+## Profiles
+```bash
+docpull https://site.com --profile rag      # Default. Dedup, rich metadata.
+docpull https://site.com --profile llm      # NDJSON + chunks + metadata.
+docpull https://site.com --profile mirror   # Full archive, polite, cached.
+docpull https://site.com --profile quick    # Sampling: 50 pages, depth 2.
+```
+## MCP server
+docpull ships an MCP (Model Context Protocol) server so AI agents can call it
+directly over stdio:
+```bash
+pip install 'docpull[mcp]'
+docpull mcp  # starts the stdio server
+```
+Add to Claude Desktop or Claude Code:
+```json
+{
+  "mcpServers": {
+    "docpull": {
+      "command": "docpull",
+      "args": ["mcp"]
+    }
+  }
+}
+```
+Tools exposed:
+- `fetch_url(url, max_tokens?)` — one-shot fetch, no crawl
+- `ensure_docs(source, force?)` — fetch a named library (cached 7 days)
+- `list_sources(category?)` — show available aliases (react, nextjs, fastapi, …)
+- `list_indexed()` — what has been fetched locally
+- `grep_docs(pattern, library?)` — regex search across fetched Markdown
+User-defined sources live in `~/.config/docpull-mcp/sources.yaml`:
+```yaml
+sources:
+  mydocs:
+    url: https://docs.example.com
+    description: My internal docs
+    category: internal
+    maxPages: 200
+```
+## Output
+Markdown files with YAML frontmatter:
+```markdown
+---
+title: "Getting Started"
+source: https://docs.example.com/guide
+source_type: "nextjs"
+---
+# Getting Started
+…
+```
+NDJSON (one record per page or chunk):
+```json
+{"url": "...", "title": "...", "content": "...", "hash": "...", "token_count": 842, "chunk_index": 0}
+```
+## Security
+- HTTPS-only, mandatory robots.txt compliance
+- SSRF protection: blocks private/internal network IPs, DNS rebinding via
+  connect-time address pinning
+- XXE protection via `defusedxml` on sitemaps
+- Path traversal and CRLF header injection guards
+- Auth headers stripped on cross-origin redirects
+When running with `--proxy`, DNS pinning is delegated to the proxy. Pass
+`--require-pinned-dns` to refuse this configuration and keep the connector-
+level SSRF guarantees in effect.
+## Options
+Run `docpull --help` for the full list. Highlights:
+```
+Core:
+  --profile {rag,mirror,quick,llm,custom}
+  --single                Fetch one URL (no crawl)
+  --format {markdown,json,ndjson,sqlite}
+  --stream                Stream NDJSON to stdout
+LLM / chunking:
+  --max-tokens-per-file N
+  --tokenizer NAME        tiktoken encoding (default cl100k_base)
+  --emit-chunks           One file/record per chunk
+Content extraction:
+  --extractor {default,trafilatura}
+  --no-special-cases      Disable framework extractors
+  --strict-js-required    Error on JS-only pages
+Cache:
+  --cache                 Enable incremental updates
+  --cache-dir DIR
+  --cache-ttl DAYS
+```
+## Performance
+End-to-end numbers from `tests/benchmarks/test_10k_pages.py` against a
+synthetic 10,000-page localhost site (RAG profile, `max_concurrent=50`,
+HTTP keep-alive, 5% injected duplicate content):
+| Metric | Value |
+|---|---|
+| Total wall time | ~27 s |
+| Discovery (sitemap parse) | ~80 ms |
+| Fetch + convert + save | ~27 s |
+| Per-page latency p50 / p95 / p99 | ~2.6 / 4.6 / 5.3 ms |
+| Peak RSS delta from baseline | ~28 MB |
+| Cache manifest size on disk | ~3.4 MB |
+| Duplicates detected (5% injected) | 499 / 500 |
+Reproduce with `make benchmark` (requires `aiohttp`; runs the gated
+benchmark in `tests/benchmarks/` and prints a JSON line you can pipe
+into trend tooling).
+## Troubleshooting
+```bash
+docpull --doctor              # Check installation
+docpull URL --verbose         # Verbose output
+docpull URL --dry-run         # Test without downloading
+docpull URL --preview-urls    # List URLs without fetching
+```
+## Links
+- [Website](https://docpull.raintree.technology)
+- [PyPI](https://pypi.org/project/docpull/)
+- [GitHub](https://github.com/raintree-technology/docpull)
+- [Changelog](https://github.com/raintree-technology/docpull/blob/main/docs/CHANGELOG.md)
+## License
+MIT

docpull-2.4.0/README.md ADDED Viewed

@@ -0,0 +1,274 @@
+# docpull
+**Security-hardened, browser-free crawler that turns static documentation sites into clean, AI-ready Markdown — fast.**
+[![Python 3.10+](https://img.shields.io/badge/python-3.10+-blue.svg)](https://www.python.org/downloads/)
+[![PyPI version](https://badge.fury.io/py/docpull.svg)](https://badge.fury.io/py/docpull)
+[![Downloads](https://pepy.tech/badge/docpull)](https://pepy.tech/project/docpull)
+[![License: MIT](https://img.shields.io/github/license/raintree-technology/docpull)](https://github.com/raintree-technology/docpull/blob/main/LICENSE)
+<p align="center">
+  <a href="https://docpull.raintree.technology">
+    <img src="https://pub-e85a1abca36f4fd8b4300a6ec2d6f45f.r2.dev/marketing/docpull/1768954147343-iaiziy-docpull-terminal-hero.gif" alt="docpull demo" width="600">
+  </a>
+</p>
+docpull uses async HTTP (not Playwright) to fetch server-rendered pages,
+extracts main content, and writes clean Markdown with source-URL frontmatter —
+in seconds, with a small install footprint. It won't render JavaScript, but for
+the large class of docs that don't need it (API references, Python/Go stdlib,
+most dev-tool docs, OpenAPI specs, Next.js and Docusaurus builds), it is a
+fast, auditable, sandbox-friendly way to pipe documentation into an LLM context,
+a RAG index, or an offline archive. SSRF, XXE, DNS-rebinding, and
+CRLF-injection protections are on by default — a necessity when an AI agent
+is choosing the URLs.
+## Install
+```bash
+pip install docpull
+# Optional extras
+pip install 'docpull[llm]'           # tiktoken for token-accurate chunking
+pip install 'docpull[trafilatura]'   # alternative extractor for noisy pages
+pip install 'docpull[mcp]'           # run as an MCP server for AI agents
+pip install 'docpull[all]'           # everything above
+```
+## Quick start
+```bash
+# Crawl and save Markdown
+docpull https://docs.example.com
+# One page, no crawl — the fast path for agents
+docpull https://docs.example.com/guide --single
+# LLM-ready NDJSON with 4k-token chunks streamed to stdout
+docpull https://docs.example.com --profile llm --stream | jq .
+# Mirror a site for offline use
+docpull https://docs.example.com --profile mirror --cache
+```
+## Framework-aware extraction
+docpull inspects each page before running the generic extractor and can pull
+content directly from framework data feeds:
+| Framework | Strategy |
+|-----------|----------|
+| Next.js   | Parses `__NEXT_DATA__` JSON |
+| Mintlify  | `__NEXT_DATA__` with Mintlify tagging |
+| OpenAPI   | Renders `openapi.json` / `swagger.json` into Markdown |
+| Docusaurus| Detected and tagged; generic extractor produces Markdown |
+| Sphinx    | Detected and tagged; generic extractor produces Markdown |
+JS-only SPAs with no server-rendered content are detected and skipped with a
+clear reason (or, with `--strict-js-required`, reported as an error so agents
+can route elsewhere).
+## Agent-friendly features
+- **`--single`** — fetch a single URL without discovery. Designed for tool loops.
+- **`--stream`** — NDJSON one-record-per-line, flushed on every page, pipeable.
+- **`--max-tokens-per-file N`** — split each page into token-bounded chunks on
+  heading boundaries (exact counts with tiktoken, estimate without).
+- **`--emit-chunks`** — write one file or record per chunk instead of per page.
+- **`--strict-js-required`** — hard-fail on JS-only pages instead of silently
+  skipping.
+- **`--extractor trafilatura`** — swap in [trafilatura](https://trafilatura.readthedocs.io/)
+  for sites where the default heuristics struggle.
+## Python API
+```python
+from docpull import fetch_one
+ctx = fetch_one("https://docs.python.org/3/library/asyncio.html")
+print(ctx.title, ctx.source_type)
+print(ctx.markdown[:500])
+```
+Async streaming:
+```python
+import asyncio
+from docpull import Fetcher, DocpullConfig, ProfileName, EventType
+async def main():
+    cfg = DocpullConfig(
+        url="https://docs.example.com",
+        profile=ProfileName.LLM,  # chunked NDJSON output
+    )
+    async with Fetcher(cfg) as fetcher:
+        async for event in fetcher.run():
+            if event.type == EventType.FETCH_PROGRESS:
+                print(f"{event.current}/{event.total}: {event.url}")
+        print(f"Done: {fetcher.stats.pages_fetched} pages")
+asyncio.run(main())
+```
+Single-page from an agent tool:
+```python
+from docpull import Fetcher, DocpullConfig
+async def tool_call(url: str) -> str:
+    async with Fetcher(DocpullConfig(url=url)) as f:
+        ctx = await f.fetch_one(url, save=False)
+        return ctx.markdown or ctx.error or ""
+```
+## Profiles
+```bash
+docpull https://site.com --profile rag      # Default. Dedup, rich metadata.
+docpull https://site.com --profile llm      # NDJSON + chunks + metadata.
+docpull https://site.com --profile mirror   # Full archive, polite, cached.
+docpull https://site.com --profile quick    # Sampling: 50 pages, depth 2.
+```
+## MCP server
+docpull ships an MCP (Model Context Protocol) server so AI agents can call it
+directly over stdio:
+```bash
+pip install 'docpull[mcp]'
+docpull mcp  # starts the stdio server
+```
+Add to Claude Desktop or Claude Code:
+```json
+{
+  "mcpServers": {
+    "docpull": {
+      "command": "docpull",
+      "args": ["mcp"]
+    }
+  }
+}
+```
+Tools exposed:
+- `fetch_url(url, max_tokens?)` — one-shot fetch, no crawl
+- `ensure_docs(source, force?)` — fetch a named library (cached 7 days)
+- `list_sources(category?)` — show available aliases (react, nextjs, fastapi, …)
+- `list_indexed()` — what has been fetched locally
+- `grep_docs(pattern, library?)` — regex search across fetched Markdown
+User-defined sources live in `~/.config/docpull-mcp/sources.yaml`:
+```yaml
+sources:
+  mydocs:
+    url: https://docs.example.com
+    description: My internal docs
+    category: internal
+    maxPages: 200
+```
+## Output
+Markdown files with YAML frontmatter:
+```markdown
+---
+title: "Getting Started"
+source: https://docs.example.com/guide
+source_type: "nextjs"
+---
+# Getting Started
+…
+```
+NDJSON (one record per page or chunk):
+```json
+{"url": "...", "title": "...", "content": "...", "hash": "...", "token_count": 842, "chunk_index": 0}
+```
+## Security
+- HTTPS-only, mandatory robots.txt compliance
+- SSRF protection: blocks private/internal network IPs, DNS rebinding via
+  connect-time address pinning
+- XXE protection via `defusedxml` on sitemaps
+- Path traversal and CRLF header injection guards
+- Auth headers stripped on cross-origin redirects
+When running with `--proxy`, DNS pinning is delegated to the proxy. Pass
+`--require-pinned-dns` to refuse this configuration and keep the connector-
+level SSRF guarantees in effect.
+## Options
+Run `docpull --help` for the full list. Highlights:
+```
+Core:
+  --profile {rag,mirror,quick,llm,custom}
+  --single                Fetch one URL (no crawl)
+  --format {markdown,json,ndjson,sqlite}
+  --stream                Stream NDJSON to stdout
+LLM / chunking:
+  --max-tokens-per-file N
+  --tokenizer NAME        tiktoken encoding (default cl100k_base)
+  --emit-chunks           One file/record per chunk
+Content extraction:
+  --extractor {default,trafilatura}
+  --no-special-cases      Disable framework extractors
+  --strict-js-required    Error on JS-only pages
+Cache:
+  --cache                 Enable incremental updates
+  --cache-dir DIR
+  --cache-ttl DAYS
+```
+## Performance
+End-to-end numbers from `tests/benchmarks/test_10k_pages.py` against a
+synthetic 10,000-page localhost site (RAG profile, `max_concurrent=50`,
+HTTP keep-alive, 5% injected duplicate content):
+| Metric | Value |
+|---|---|
+| Total wall time | ~27 s |
+| Discovery (sitemap parse) | ~80 ms |
+| Fetch + convert + save | ~27 s |
+| Per-page latency p50 / p95 / p99 | ~2.6 / 4.6 / 5.3 ms |
+| Peak RSS delta from baseline | ~28 MB |
+| Cache manifest size on disk | ~3.4 MB |
+| Duplicates detected (5% injected) | 499 / 500 |
+Reproduce with `make benchmark` (requires `aiohttp`; runs the gated
+benchmark in `tests/benchmarks/` and prints a JSON line you can pipe
+into trend tooling).
+## Troubleshooting
+```bash
+docpull --doctor              # Check installation
+docpull URL --verbose         # Verbose output
+docpull URL --dry-run         # Test without downloading
+docpull URL --preview-urls    # List URLs without fetching
+```
+## Links
+- [Website](https://docpull.raintree.technology)
+- [PyPI](https://pypi.org/project/docpull/)
+- [GitHub](https://github.com/raintree-technology/docpull)
+- [Changelog](https://github.com/raintree-technology/docpull/blob/main/docs/CHANGELOG.md)
+## License
+MIT

docpull 2.2.0__tar.gz → 2.4.0__tar.gz

docpull 2.2.0tar.gz → 2.4.0tar.gz