PyPI - docpull - Versions diffs - 1.0.1__tar.gz → 1.0.2__tar.gz - Mend

docpull 1.0.1tar.gz → 1.0.2tar.gz

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (49) hide show

docpull-1.0.2/PKG-INFO ADDED Viewed

@@ -0,0 +1,215 @@
+Metadata-Version: 2.4
+Name: docpull
+Version: 1.0.2
+Summary: Pull documentation from the web and convert to clean markdown
+Author-email: Zachary Roth <support@raintree.technology>
+Maintainer-email: Raintree Technology <support@raintree.technology>
+License-Expression: MIT
+Project-URL: Homepage, https://github.com/raintree-technology/docpull
+Project-URL: Documentation, https://github.com/raintree-technology/docpull#readme
+Project-URL: Repository, https://github.com/raintree-technology/docpull
+Project-URL: Source Code, https://github.com/raintree-technology/docpull
+Project-URL: Bug Tracker, https://github.com/raintree-technology/docpull/issues
+Project-URL: Changelog, https://github.com/raintree-technology/docpull/blob/main/CHANGELOG.md
+Keywords: python,markdown,documentation,web-scraping,developer-tools,claude,ai-training-data
+Classifier: Development Status :: 5 - Production/Stable
+Classifier: Intended Audience :: Developers
+Classifier: Intended Audience :: Information Technology
+Classifier: Intended Audience :: Science/Research
+Classifier: Intended Audience :: Education
+Classifier: Environment :: Console
+Classifier: Topic :: Documentation
+Classifier: Topic :: Internet :: WWW/HTTP :: Indexing/Search
+Classifier: Topic :: Software Development :: Documentation
+Classifier: Topic :: Text Processing :: Markup :: HTML
+Classifier: Topic :: Text Processing :: Markup :: Markdown
+Classifier: Topic :: Utilities
+Classifier: Topic :: Scientific/Engineering :: Artificial Intelligence
+Classifier: Natural Language :: English
+Classifier: Operating System :: OS Independent
+Classifier: Programming Language :: Python :: 3
+Classifier: Programming Language :: Python :: 3.9
+Classifier: Programming Language :: Python :: 3.10
+Classifier: Programming Language :: Python :: 3.11
+Classifier: Programming Language :: Python :: 3.12
+Classifier: Programming Language :: Python :: 3.13
+Classifier: Programming Language :: Python :: 3 :: Only
+Classifier: Typing :: Typed
+Requires-Python: >=3.9
+Description-Content-Type: text/markdown
+License-File: LICENSE
+Requires-Dist: requests>=2.31.0
+Requires-Dist: beautifulsoup4>=4.12.0
+Requires-Dist: html2text>=2020.1.16
+Requires-Dist: defusedxml>=0.7.1
+Requires-Dist: aiohttp>=3.9.0
+Requires-Dist: rich>=13.0.0
+Provides-Extra: yaml
+Requires-Dist: pyyaml>=6.0; extra == "yaml"
+Provides-Extra: js
+Requires-Dist: playwright>=1.40.0; extra == "js"
+Provides-Extra: all
+Requires-Dist: pyyaml>=6.0; extra == "all"
+Requires-Dist: playwright>=1.40.0; extra == "all"
+Provides-Extra: dev
+Requires-Dist: pytest>=7.0.0; extra == "dev"
+Requires-Dist: pytest-cov>=4.0.0; extra == "dev"
+Requires-Dist: pytest-asyncio>=0.21.0; extra == "dev"
+Requires-Dist: black>=23.0.0; extra == "dev"
+Requires-Dist: mypy>=1.0.0; extra == "dev"
+Requires-Dist: ruff>=0.1.0; extra == "dev"
+Requires-Dist: bandit>=1.7.0; extra == "dev"
+Requires-Dist: pip-audit>=2.0.0; extra == "dev"
+Requires-Dist: types-requests>=2.31.0; extra == "dev"
+Requires-Dist: types-beautifulsoup4>=4.12.0; extra == "dev"
+Requires-Dist: types-aiohttp>=3.9.0; extra == "dev"
+Dynamic: license-file
+# docpull
+**Pull documentation from any website and converts it into clean, AI-ready Markdown.**
+Fast, type-safe, secure, and optimized for building knowledge bases or training datasets.
+[![Python 3.9+](https://img.shields.io/badge/python-3.9+-blue.svg)](https://www.python.org/downloads/)
+[![PyPI version](https://badge.fury.io/py/docpull.svg)](https://badge.fury.io/py/docpull)
+[![License: MIT](https://img.shields.io/github/license/raintree-technology/docpull)](https://github.com/raintree-technology/docpull/blob/main/LICENSE)
+[![Code style: black](https://img.shields.io/badge/code%20style-black-000000.svg)](https://github.com/psf/black)
+[![Type checked: mypy](https://img.shields.io/badge/type%20checked-mypy-blue.svg)](http://mypy-lang.org/)
+[![Security: bandit](https://img.shields.io/badge/security-bandit-yellow.svg)](https://github.com/PyCQA/bandit)
+## Why docpull?
+Unlike tools like wget or httrack, docpull extracts only the main content, removing ads, navbars, and clutter. Output is clean Markdown with optional YAML frontmatter—ideal for RAG systems, offline docs, or ML pipelines.
+## Key Features
+- Works on any documentation site
+- Smart extraction of main content
+- Async + parallel fetching (up to 10× faster)
+- Optional JavaScript rendering via Playwright
+- Sitemap + link crawling
+- URL-based filtering (include/exclude)
+- Rate limiting, timeouts, content-type checks
+- Saves docs in structured Markdown with YAML metadata
+- Optimized profiles for popular platforms (Stripe, Next.js, React, Plaid, Tailwind, etc.)
+## Quick Start
+```bash
+pip install docpull
+docpull https://aptos.dev
+docpull stripe           # use a built-in profile
+docpull https://site.com/docs --max-pages 100 --max-concurrent 20
+```
+### JavaScript-heavy sites
+```bash
+pip install docpull[js]
+python -m playwright install chromium
+docpull https://site.com --js
+```
+## Python API
+```python
+from docpull import GenericAsyncFetcher
+fetcher = GenericAsyncFetcher(
+    url_or_profile="https://aptos.dev",
+    output_dir="./docs",
+    max_pages=100,
+    max_concurrent=20,
+)
+fetcher.fetch()
+```
+## Common Options
+- `--max-pages N` – limit crawl size
+- `--max-depth N` – restrict link depth
+- `--max-concurrent N` – control parallel fetches
+- `--js` – enable Playwright rendering
+- `--output-dir DIR`
+- `--rate-limit X`
+- `--no-skip-existing`
+- `--dry-run`
+## Performance
+Async fetching drastically reduces runtime:
+| Pages | Sync | Async | Speedup |
+|-------|------|-------|---------|
+| 50 | ~50s | ~6s | 8× faster |
+Higher concurrency yields even better results.
+## Output Format
+Each downloaded page becomes a Markdown file:
+```markdown
+---
+url: https://stripe.com/docs/payments
+fetched: 2025-11-13
+---
+# Payment Intents
+...
+```
+Directory layout mirrors the target site's structure.
+## Configuration File (Optional)
+```yaml
+output_dir: ./docs
+rate_limit: 0.5
+sources:
+  - stripe
+  - nextjs
+```
+Run with:
+```bash
+docpull --config config.yaml
+```
+## Custom Profiles
+Easily define profiles for frequently scraped sites.
+```python
+from docpull.profiles.base import SiteProfile
+MY_PROFILE = SiteProfile(
+    name="mysite",
+    domains={"docs.mysite.com"},
+    include_patterns=["/docs/", "/api/"],
+)
+```
+## Security
+- HTTPS-only
+- Blocks private network IPs
+- 50MB page size limit
+- Timeout controls
+- Validates content-type
+- Playwright sandboxing
+## Troubleshooting
+- **Site requires JS**: install Playwright + `--js`
+- **Slow or rate limited**: lower concurrency or raise `--rate-limit`
+- **Large sites**: set `--max-pages`
+## Links
+- [PyPI](https://pypi.org/project/docpull/)
+- [GitHub](https://github.com/raintree-technology/docpull)
+- [Issues](https://github.com/raintree-technology/docpull/issues)
+## License
+MIT License - see [LICENSE](LICENSE) file for details

docpull-1.0.2/README.md ADDED Viewed

@@ -0,0 +1,148 @@
+# docpull
+**Pull documentation from any website and converts it into clean, AI-ready Markdown.**
+Fast, type-safe, secure, and optimized for building knowledge bases or training datasets.
+[![Python 3.9+](https://img.shields.io/badge/python-3.9+-blue.svg)](https://www.python.org/downloads/)
+[![PyPI version](https://badge.fury.io/py/docpull.svg)](https://badge.fury.io/py/docpull)
+[![License: MIT](https://img.shields.io/github/license/raintree-technology/docpull)](https://github.com/raintree-technology/docpull/blob/main/LICENSE)
+[![Code style: black](https://img.shields.io/badge/code%20style-black-000000.svg)](https://github.com/psf/black)
+[![Type checked: mypy](https://img.shields.io/badge/type%20checked-mypy-blue.svg)](http://mypy-lang.org/)
+[![Security: bandit](https://img.shields.io/badge/security-bandit-yellow.svg)](https://github.com/PyCQA/bandit)
+## Why docpull?
+Unlike tools like wget or httrack, docpull extracts only the main content, removing ads, navbars, and clutter. Output is clean Markdown with optional YAML frontmatter—ideal for RAG systems, offline docs, or ML pipelines.
+## Key Features
+- Works on any documentation site
+- Smart extraction of main content
+- Async + parallel fetching (up to 10× faster)
+- Optional JavaScript rendering via Playwright
+- Sitemap + link crawling
+- URL-based filtering (include/exclude)
+- Rate limiting, timeouts, content-type checks
+- Saves docs in structured Markdown with YAML metadata
+- Optimized profiles for popular platforms (Stripe, Next.js, React, Plaid, Tailwind, etc.)
+## Quick Start
+```bash
+pip install docpull
+docpull https://aptos.dev
+docpull stripe           # use a built-in profile
+docpull https://site.com/docs --max-pages 100 --max-concurrent 20
+```
+### JavaScript-heavy sites
+```bash
+pip install docpull[js]
+python -m playwright install chromium
+docpull https://site.com --js
+```
+## Python API
+```python
+from docpull import GenericAsyncFetcher
+fetcher = GenericAsyncFetcher(
+    url_or_profile="https://aptos.dev",
+    output_dir="./docs",
+    max_pages=100,
+    max_concurrent=20,
+)
+fetcher.fetch()
+```
+## Common Options
+- `--max-pages N` – limit crawl size
+- `--max-depth N` – restrict link depth
+- `--max-concurrent N` – control parallel fetches
+- `--js` – enable Playwright rendering
+- `--output-dir DIR`
+- `--rate-limit X`
+- `--no-skip-existing`
+- `--dry-run`
+## Performance
+Async fetching drastically reduces runtime:
+| Pages | Sync | Async | Speedup |
+|-------|------|-------|---------|
+| 50 | ~50s | ~6s | 8× faster |
+Higher concurrency yields even better results.
+## Output Format
+Each downloaded page becomes a Markdown file:
+```markdown
+---
+url: https://stripe.com/docs/payments
+fetched: 2025-11-13
+---
+# Payment Intents
+...
+```
+Directory layout mirrors the target site's structure.
+## Configuration File (Optional)
+```yaml
+output_dir: ./docs
+rate_limit: 0.5
+sources:
+  - stripe
+  - nextjs
+```
+Run with:
+```bash
+docpull --config config.yaml
+```
+## Custom Profiles
+Easily define profiles for frequently scraped sites.
+```python
+from docpull.profiles.base import SiteProfile
+MY_PROFILE = SiteProfile(
+    name="mysite",
+    domains={"docs.mysite.com"},
+    include_patterns=["/docs/", "/api/"],
+)
+```
+## Security
+- HTTPS-only
+- Blocks private network IPs
+- 50MB page size limit
+- Timeout controls
+- Validates content-type
+- Playwright sandboxing
+## Troubleshooting
+- **Site requires JS**: install Playwright + `--js`
+- **Slow or rate limited**: lower concurrency or raise `--rate-limit`
+- **Large sites**: set `--max-pages`
+## Links
+- [PyPI](https://pypi.org/project/docpull/)
+- [GitHub](https://github.com/raintree-technology/docpull)
+- [Issues](https://github.com/raintree-technology/docpull/issues)
+## License
+MIT License - see [LICENSE](LICENSE) file for details

{docpull-1.0.1 → docpull-1.0.2}/docpull/__init__.py RENAMED Viewed

@@ -1,4 +1,4 @@
-__version__ = "1.0.1"
+__version__ = "1.0.2"
 from .fetchers.base import BaseFetcher
 from .fetchers.bun import BunFetcher

docpull-1.0.2/docpull.egg-info/PKG-INFO ADDED Viewed

@@ -0,0 +1,215 @@
+Metadata-Version: 2.4
+Name: docpull
+Version: 1.0.2
+Summary: Pull documentation from the web and convert to clean markdown
+Author-email: Zachary Roth <support@raintree.technology>
+Maintainer-email: Raintree Technology <support@raintree.technology>
+License-Expression: MIT
+Project-URL: Homepage, https://github.com/raintree-technology/docpull
+Project-URL: Documentation, https://github.com/raintree-technology/docpull#readme
+Project-URL: Repository, https://github.com/raintree-technology/docpull
+Project-URL: Source Code, https://github.com/raintree-technology/docpull
+Project-URL: Bug Tracker, https://github.com/raintree-technology/docpull/issues
+Project-URL: Changelog, https://github.com/raintree-technology/docpull/blob/main/CHANGELOG.md
+Keywords: python,markdown,documentation,web-scraping,developer-tools,claude,ai-training-data
+Classifier: Development Status :: 5 - Production/Stable
+Classifier: Intended Audience :: Developers
+Classifier: Intended Audience :: Information Technology
+Classifier: Intended Audience :: Science/Research
+Classifier: Intended Audience :: Education
+Classifier: Environment :: Console
+Classifier: Topic :: Documentation
+Classifier: Topic :: Internet :: WWW/HTTP :: Indexing/Search
+Classifier: Topic :: Software Development :: Documentation
+Classifier: Topic :: Text Processing :: Markup :: HTML
+Classifier: Topic :: Text Processing :: Markup :: Markdown
+Classifier: Topic :: Utilities
+Classifier: Topic :: Scientific/Engineering :: Artificial Intelligence
+Classifier: Natural Language :: English
+Classifier: Operating System :: OS Independent
+Classifier: Programming Language :: Python :: 3
+Classifier: Programming Language :: Python :: 3.9
+Classifier: Programming Language :: Python :: 3.10
+Classifier: Programming Language :: Python :: 3.11
+Classifier: Programming Language :: Python :: 3.12
+Classifier: Programming Language :: Python :: 3.13
+Classifier: Programming Language :: Python :: 3 :: Only
+Classifier: Typing :: Typed
+Requires-Python: >=3.9
+Description-Content-Type: text/markdown
+License-File: LICENSE
+Requires-Dist: requests>=2.31.0
+Requires-Dist: beautifulsoup4>=4.12.0
+Requires-Dist: html2text>=2020.1.16
+Requires-Dist: defusedxml>=0.7.1
+Requires-Dist: aiohttp>=3.9.0
+Requires-Dist: rich>=13.0.0
+Provides-Extra: yaml
+Requires-Dist: pyyaml>=6.0; extra == "yaml"
+Provides-Extra: js
+Requires-Dist: playwright>=1.40.0; extra == "js"
+Provides-Extra: all
+Requires-Dist: pyyaml>=6.0; extra == "all"
+Requires-Dist: playwright>=1.40.0; extra == "all"
+Provides-Extra: dev
+Requires-Dist: pytest>=7.0.0; extra == "dev"
+Requires-Dist: pytest-cov>=4.0.0; extra == "dev"
+Requires-Dist: pytest-asyncio>=0.21.0; extra == "dev"
+Requires-Dist: black>=23.0.0; extra == "dev"
+Requires-Dist: mypy>=1.0.0; extra == "dev"
+Requires-Dist: ruff>=0.1.0; extra == "dev"
+Requires-Dist: bandit>=1.7.0; extra == "dev"
+Requires-Dist: pip-audit>=2.0.0; extra == "dev"
+Requires-Dist: types-requests>=2.31.0; extra == "dev"
+Requires-Dist: types-beautifulsoup4>=4.12.0; extra == "dev"
+Requires-Dist: types-aiohttp>=3.9.0; extra == "dev"
+Dynamic: license-file
+# docpull
+**Pull documentation from any website and converts it into clean, AI-ready Markdown.**
+Fast, type-safe, secure, and optimized for building knowledge bases or training datasets.
+[![Python 3.9+](https://img.shields.io/badge/python-3.9+-blue.svg)](https://www.python.org/downloads/)
+[![PyPI version](https://badge.fury.io/py/docpull.svg)](https://badge.fury.io/py/docpull)
+[![License: MIT](https://img.shields.io/github/license/raintree-technology/docpull)](https://github.com/raintree-technology/docpull/blob/main/LICENSE)
+[![Code style: black](https://img.shields.io/badge/code%20style-black-000000.svg)](https://github.com/psf/black)
+[![Type checked: mypy](https://img.shields.io/badge/type%20checked-mypy-blue.svg)](http://mypy-lang.org/)
+[![Security: bandit](https://img.shields.io/badge/security-bandit-yellow.svg)](https://github.com/PyCQA/bandit)
+## Why docpull?
+Unlike tools like wget or httrack, docpull extracts only the main content, removing ads, navbars, and clutter. Output is clean Markdown with optional YAML frontmatter—ideal for RAG systems, offline docs, or ML pipelines.
+## Key Features
+- Works on any documentation site
+- Smart extraction of main content
+- Async + parallel fetching (up to 10× faster)
+- Optional JavaScript rendering via Playwright
+- Sitemap + link crawling
+- URL-based filtering (include/exclude)
+- Rate limiting, timeouts, content-type checks
+- Saves docs in structured Markdown with YAML metadata
+- Optimized profiles for popular platforms (Stripe, Next.js, React, Plaid, Tailwind, etc.)
+## Quick Start
+```bash
+pip install docpull
+docpull https://aptos.dev
+docpull stripe           # use a built-in profile
+docpull https://site.com/docs --max-pages 100 --max-concurrent 20
+```
+### JavaScript-heavy sites
+```bash
+pip install docpull[js]
+python -m playwright install chromium
+docpull https://site.com --js
+```
+## Python API
+```python
+from docpull import GenericAsyncFetcher
+fetcher = GenericAsyncFetcher(
+    url_or_profile="https://aptos.dev",
+    output_dir="./docs",
+    max_pages=100,
+    max_concurrent=20,
+)
+fetcher.fetch()
+```
+## Common Options
+- `--max-pages N` – limit crawl size
+- `--max-depth N` – restrict link depth
+- `--max-concurrent N` – control parallel fetches
+- `--js` – enable Playwright rendering
+- `--output-dir DIR`
+- `--rate-limit X`
+- `--no-skip-existing`
+- `--dry-run`
+## Performance
+Async fetching drastically reduces runtime:
+| Pages | Sync | Async | Speedup |
+|-------|------|-------|---------|
+| 50 | ~50s | ~6s | 8× faster |
+Higher concurrency yields even better results.
+## Output Format
+Each downloaded page becomes a Markdown file:
+```markdown
+---
+url: https://stripe.com/docs/payments
+fetched: 2025-11-13
+---
+# Payment Intents
+...
+```
+Directory layout mirrors the target site's structure.
+## Configuration File (Optional)
+```yaml
+output_dir: ./docs
+rate_limit: 0.5
+sources:
+  - stripe
+  - nextjs
+```
+Run with:
+```bash
+docpull --config config.yaml
+```
+## Custom Profiles
+Easily define profiles for frequently scraped sites.
+```python
+from docpull.profiles.base import SiteProfile
+MY_PROFILE = SiteProfile(
+    name="mysite",
+    domains={"docs.mysite.com"},
+    include_patterns=["/docs/", "/api/"],
+)
+```
+## Security
+- HTTPS-only
+- Blocks private network IPs
+- 50MB page size limit
+- Timeout controls
+- Validates content-type
+- Playwright sandboxing
+## Troubleshooting
+- **Site requires JS**: install Playwright + `--js`
+- **Slow or rate limited**: lower concurrency or raise `--rate-limit`
+- **Large sites**: set `--max-pages`
+## Links
+- [PyPI](https://pypi.org/project/docpull/)
+- [GitHub](https://github.com/raintree-technology/docpull)
+- [Issues](https://github.com/raintree-technology/docpull/issues)
+## License
+MIT License - see [LICENSE](LICENSE) file for details

{docpull-1.0.1 → docpull-1.0.2}/pyproject.toml RENAMED Viewed

@@ -4,7 +4,7 @@ build-backend = "setuptools.build_meta"
 [project]
 name = "docpull"
-version = "1.0.1"
+version = "1.0.2"
 description = "Pull documentation from the web and convert to clean markdown"
 readme = {file = "README.md", content-type = "text/markdown"}
 requires-python = ">=3.9"

docpull 1.0.1__tar.gz → 1.0.2__tar.gz

docpull 1.0.1tar.gz → 1.0.2tar.gz