PyPI - docpull - Versions diffs - 1.0.1__tar.gz → 1.1.0__tar.gz - Mend

docpull 1.0.1tar.gz → 1.1.0tar.gz

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (50) hide show

docpull-1.1.0/PKG-INFO ADDED Viewed

@@ -0,0 +1,221 @@
+Metadata-Version: 2.4
+Name: docpull
+Version: 1.1.0
+Summary: Pull documentation from the web and convert to clean markdown
+Author-email: Zachary Roth <support@raintree.technology>
+Maintainer-email: Raintree Technology <support@raintree.technology>
+License-Expression: MIT
+Project-URL: Homepage, https://github.com/raintree-technology/docpull
+Project-URL: Documentation, https://github.com/raintree-technology/docpull#readme
+Project-URL: Repository, https://github.com/raintree-technology/docpull
+Project-URL: Source Code, https://github.com/raintree-technology/docpull
+Project-URL: Bug Tracker, https://github.com/raintree-technology/docpull/issues
+Project-URL: Changelog, https://github.com/raintree-technology/docpull/blob/main/CHANGELOG.md
+Keywords: python,markdown,documentation,web-scraping,developer-tools,claude,ai-training-data
+Classifier: Development Status :: 5 - Production/Stable
+Classifier: Intended Audience :: Developers
+Classifier: Intended Audience :: Information Technology
+Classifier: Intended Audience :: Science/Research
+Classifier: Intended Audience :: Education
+Classifier: Environment :: Console
+Classifier: Topic :: Documentation
+Classifier: Topic :: Internet :: WWW/HTTP :: Indexing/Search
+Classifier: Topic :: Software Development :: Documentation
+Classifier: Topic :: Text Processing :: Markup :: HTML
+Classifier: Topic :: Text Processing :: Markup :: Markdown
+Classifier: Topic :: Utilities
+Classifier: Topic :: Scientific/Engineering :: Artificial Intelligence
+Classifier: Natural Language :: English
+Classifier: Operating System :: OS Independent
+Classifier: Programming Language :: Python :: 3
+Classifier: Programming Language :: Python :: 3.9
+Classifier: Programming Language :: Python :: 3.10
+Classifier: Programming Language :: Python :: 3.11
+Classifier: Programming Language :: Python :: 3.12
+Classifier: Programming Language :: Python :: 3.13
+Classifier: Programming Language :: Python :: 3 :: Only
+Classifier: Typing :: Typed
+Requires-Python: >=3.9
+Description-Content-Type: text/markdown
+License-File: LICENSE
+Requires-Dist: requests>=2.31.0
+Requires-Dist: beautifulsoup4>=4.12.0
+Requires-Dist: html2text>=2020.1.16
+Requires-Dist: defusedxml>=0.7.1
+Requires-Dist: aiohttp>=3.9.0
+Requires-Dist: rich>=13.0.0
+Provides-Extra: yaml
+Requires-Dist: pyyaml>=6.0; extra == "yaml"
+Provides-Extra: js
+Requires-Dist: playwright>=1.40.0; extra == "js"
+Provides-Extra: all
+Requires-Dist: pyyaml>=6.0; extra == "all"
+Requires-Dist: playwright>=1.40.0; extra == "all"
+Provides-Extra: dev
+Requires-Dist: pytest>=7.0.0; extra == "dev"
+Requires-Dist: pytest-cov>=4.0.0; extra == "dev"
+Requires-Dist: pytest-asyncio>=0.21.0; extra == "dev"
+Requires-Dist: black>=23.0.0; extra == "dev"
+Requires-Dist: mypy>=1.0.0; extra == "dev"
+Requires-Dist: ruff>=0.1.0; extra == "dev"
+Requires-Dist: bandit>=1.7.0; extra == "dev"
+Requires-Dist: pip-audit>=2.0.0; extra == "dev"
+Requires-Dist: types-requests>=2.31.0; extra == "dev"
+Requires-Dist: types-beautifulsoup4>=4.12.0; extra == "dev"
+Requires-Dist: types-defusedxml>=0.7.0; extra == "dev"
+Dynamic: license-file
+# docpull
+**Pull documentation from any website and converts it into clean, AI-ready Markdown.**
+Fast, type-safe, secure, and optimized for building knowledge bases or training datasets.
+[![Python 3.9+](https://img.shields.io/badge/python-3.9+-blue.svg)](https://www.python.org/downloads/)
+[![PyPI version](https://badge.fury.io/py/docpull.svg)](https://badge.fury.io/py/docpull)
+[![License: MIT](https://img.shields.io/github/license/raintree-technology/docpull)](https://github.com/raintree-technology/docpull/blob/main/LICENSE)
+[![Code style: black](https://img.shields.io/badge/code%20style-black-000000.svg)](https://github.com/psf/black)
+[![Type checked: mypy](https://img.shields.io/badge/type%20checked-mypy-blue.svg)](http://mypy-lang.org/)
+[![Security: bandit](https://img.shields.io/badge/security-bandit-yellow.svg)](https://github.com/PyCQA/bandit)
+## Why docpull?
+Unlike tools like wget or httrack, docpull extracts only the main content, removing ads, navbars, and clutter. Output is clean Markdown with optional YAML frontmatter—ideal for RAG systems, offline docs, or ML pipelines.
+## Key Features
+- Works on any documentation site
+- Smart extraction of main content
+- Async + parallel fetching (up to 10× faster)
+- Optional JavaScript rendering via Playwright
+- Sitemap + link crawling
+- URL-based filtering (include/exclude)
+- Rate limiting, timeouts, content-type checks
+- Saves docs in structured Markdown with YAML metadata
+- Optimized profiles for popular platforms (Stripe, Next.js, React, Plaid, Tailwind, etc.)
+## Quick Start
+```bash
+pip install docpull
+docpull --doctor         # verify installation
+docpull https://aptos.dev
+docpull stripe           # use a built-in profile
+docpull https://site.com/docs --max-pages 100 --max-concurrent 20
+```
+### JavaScript-heavy sites
+```bash
+pip install docpull[js]
+python -m playwright install chromium
+docpull https://site.com --js
+```
+## Python API
+```python
+from docpull import GenericAsyncFetcher
+fetcher = GenericAsyncFetcher(
+    url_or_profile="https://aptos.dev",
+    output_dir="./docs",
+    max_pages=100,
+    max_concurrent=20,
+)
+fetcher.fetch()
+```
+## Common Options
+- `--doctor` – verify installation and dependencies
+- `--max-pages N` – limit crawl size
+- `--max-depth N` – restrict link depth
+- `--max-concurrent N` – control parallel fetches
+- `--js` – enable Playwright rendering
+- `--output-dir DIR`
+- `--rate-limit X`
+- `--no-skip-existing`
+- `--dry-run`
+## Performance
+Async fetching drastically reduces runtime:
+| Pages | Sync | Async | Speedup |
+|-------|------|-------|---------|
+| 50 | ~50s | ~6s | 8× faster |
+Higher concurrency yields even better results.
+## Output Format
+Each downloaded page becomes a Markdown file:
+```markdown
+---
+url: https://stripe.com/docs/payments
+fetched: 2025-11-13
+---
+# Payment Intents
+...
+```
+Directory layout mirrors the target site's structure.
+## Configuration File (Optional)
+```yaml
+output_dir: ./docs
+rate_limit: 0.5
+sources:
+  - stripe
+  - nextjs
+```
+Run with:
+```bash
+docpull --config config.yaml
+```
+## Custom Profiles
+Easily define profiles for frequently scraped sites.
+```python
+from docpull.profiles.base import SiteProfile
+MY_PROFILE = SiteProfile(
+    name="mysite",
+    domains={"docs.mysite.com"},
+    include_patterns=["/docs/", "/api/"],
+)
+```
+## Security
+- HTTPS-only
+- Blocks private network IPs
+- 50MB page size limit
+- Timeout controls
+- Validates content-type
+- Playwright sandboxing
+## Troubleshooting
+- **Installation issues**: Run `docpull --doctor` to diagnose problems
+- **Missing dependencies**: See [TROUBLESHOOTING.md](TROUBLESHOOTING.md) for common fixes
+- **Site requires JS**: install Playwright + `--js`
+- **Slow or rate limited**: lower concurrency or raise `--rate-limit`
+- **Large sites**: set `--max-pages`
+For detailed troubleshooting, see [TROUBLESHOOTING.md](TROUBLESHOOTING.md).
+## Links
+- [PyPI](https://pypi.org/project/docpull/)
+- [GitHub](https://github.com/raintree-technology/docpull)
+- [Issues](https://github.com/raintree-technology/docpull/issues)
+## License
+MIT License - see [LICENSE](LICENSE) file for details

docpull-1.1.0/README.md ADDED Viewed

@@ -0,0 +1,154 @@
+# docpull
+**Pull documentation from any website and converts it into clean, AI-ready Markdown.**
+Fast, type-safe, secure, and optimized for building knowledge bases or training datasets.
+[![Python 3.9+](https://img.shields.io/badge/python-3.9+-blue.svg)](https://www.python.org/downloads/)
+[![PyPI version](https://badge.fury.io/py/docpull.svg)](https://badge.fury.io/py/docpull)
+[![License: MIT](https://img.shields.io/github/license/raintree-technology/docpull)](https://github.com/raintree-technology/docpull/blob/main/LICENSE)
+[![Code style: black](https://img.shields.io/badge/code%20style-black-000000.svg)](https://github.com/psf/black)
+[![Type checked: mypy](https://img.shields.io/badge/type%20checked-mypy-blue.svg)](http://mypy-lang.org/)
+[![Security: bandit](https://img.shields.io/badge/security-bandit-yellow.svg)](https://github.com/PyCQA/bandit)
+## Why docpull?
+Unlike tools like wget or httrack, docpull extracts only the main content, removing ads, navbars, and clutter. Output is clean Markdown with optional YAML frontmatter—ideal for RAG systems, offline docs, or ML pipelines.
+## Key Features
+- Works on any documentation site
+- Smart extraction of main content
+- Async + parallel fetching (up to 10× faster)
+- Optional JavaScript rendering via Playwright
+- Sitemap + link crawling
+- URL-based filtering (include/exclude)
+- Rate limiting, timeouts, content-type checks
+- Saves docs in structured Markdown with YAML metadata
+- Optimized profiles for popular platforms (Stripe, Next.js, React, Plaid, Tailwind, etc.)
+## Quick Start
+```bash
+pip install docpull
+docpull --doctor         # verify installation
+docpull https://aptos.dev
+docpull stripe           # use a built-in profile
+docpull https://site.com/docs --max-pages 100 --max-concurrent 20
+```
+### JavaScript-heavy sites
+```bash
+pip install docpull[js]
+python -m playwright install chromium
+docpull https://site.com --js
+```
+## Python API
+```python
+from docpull import GenericAsyncFetcher
+fetcher = GenericAsyncFetcher(
+    url_or_profile="https://aptos.dev",
+    output_dir="./docs",
+    max_pages=100,
+    max_concurrent=20,
+)
+fetcher.fetch()
+```
+## Common Options
+- `--doctor` – verify installation and dependencies
+- `--max-pages N` – limit crawl size
+- `--max-depth N` – restrict link depth
+- `--max-concurrent N` – control parallel fetches
+- `--js` – enable Playwright rendering
+- `--output-dir DIR`
+- `--rate-limit X`
+- `--no-skip-existing`
+- `--dry-run`
+## Performance
+Async fetching drastically reduces runtime:
+| Pages | Sync | Async | Speedup |
+|-------|------|-------|---------|
+| 50 | ~50s | ~6s | 8× faster |
+Higher concurrency yields even better results.
+## Output Format
+Each downloaded page becomes a Markdown file:
+```markdown
+---
+url: https://stripe.com/docs/payments
+fetched: 2025-11-13
+---
+# Payment Intents
+...
+```
+Directory layout mirrors the target site's structure.
+## Configuration File (Optional)
+```yaml
+output_dir: ./docs
+rate_limit: 0.5
+sources:
+  - stripe
+  - nextjs
+```
+Run with:
+```bash
+docpull --config config.yaml
+```
+## Custom Profiles
+Easily define profiles for frequently scraped sites.
+```python
+from docpull.profiles.base import SiteProfile
+MY_PROFILE = SiteProfile(
+    name="mysite",
+    domains={"docs.mysite.com"},
+    include_patterns=["/docs/", "/api/"],
+)
+```
+## Security
+- HTTPS-only
+- Blocks private network IPs
+- 50MB page size limit
+- Timeout controls
+- Validates content-type
+- Playwright sandboxing
+## Troubleshooting
+- **Installation issues**: Run `docpull --doctor` to diagnose problems
+- **Missing dependencies**: See [TROUBLESHOOTING.md](TROUBLESHOOTING.md) for common fixes
+- **Site requires JS**: install Playwright + `--js`
+- **Slow or rate limited**: lower concurrency or raise `--rate-limit`
+- **Large sites**: set `--max-pages`
+For detailed troubleshooting, see [TROUBLESHOOTING.md](TROUBLESHOOTING.md).
+## Links
+- [PyPI](https://pypi.org/project/docpull/)
+- [GitHub](https://github.com/raintree-technology/docpull)
+- [Issues](https://github.com/raintree-technology/docpull/issues)
+## License
+MIT License - see [LICENSE](LICENSE) file for details

{docpull-1.0.1 → docpull-1.1.0}/docpull/__init__.py RENAMED Viewed

@@ -1,4 +1,4 @@
-__version__ = "1.0.1"
+__version__ = "1.1.0"
 from .fetchers.base import BaseFetcher
 from .fetchers.bun import BunFetcher

{docpull-1.0.1 → docpull-1.1.0}/docpull/cli.py RENAMED Viewed

@@ -3,6 +3,40 @@ import sys
 from pathlib import Path
 from typing import Optional
+# Check if --doctor flag is present before checking dependencies
+# This allows users to diagnose issues even when dependencies are missing
+if "--doctor" in sys.argv:
+    from .doctor import run_doctor
+    # Parse output dir if provided
+    output_dir = None
+    if "--output-dir" in sys.argv or "-o" in sys.argv:
+        try:
+            flag_idx = sys.argv.index("--output-dir") if "--output-dir" in sys.argv else sys.argv.index("-o")
+            if flag_idx + 1 < len(sys.argv):
+                output_dir = Path(sys.argv[flag_idx + 1])
+        except (ValueError, IndexError):
+            pass
+    sys.exit(run_doctor(output_dir=output_dir))
+# Verify core dependencies are available
+try:
+    import aiohttp  # noqa: F401
+    import bs4  # noqa: F401
+    import defusedxml  # noqa: F401
+    import html2text  # noqa: F401
+    import requests  # noqa: F401
+    import rich  # noqa: F401
+except ImportError as e:
+    print(f"\nERROR: Missing required dependency: {e.name}", file=sys.stderr)
+    print("\nDocpull requires all core dependencies to be installed.", file=sys.stderr)
+    print("\nRecommended fixes:", file=sys.stderr)
+    print("  1. For pipx users: pipx reinstall docpull --force", file=sys.stderr)
+    print("  2. For pip users: pip install --upgrade --force-reinstall docpull", file=sys.stderr)
+    print("  3. For development: pip install -e .[dev]", file=sys.stderr)
+    print("\nTo diagnose issues, run: docpull --doctor", file=sys.stderr)
+    sys.exit(1)
 from . import __version__
 from .config import FetcherConfig
 from .fetchers import (
@@ -185,6 +219,12 @@ Examples:
         version=f"%(prog)s {__version__}",
     )
+    parser.add_argument(
+        "--doctor",
+        action="store_true",
+        help="Run diagnostic checks to verify installation",
+    )
     return parser
@@ -200,17 +240,31 @@ def generate_sample_config(output_path: Path) -> None:
     # Determine format from extension
     suffix = output_path.suffix.lower()
-    if suffix in [".yaml", ".yml"]:
-        config.save_yaml(output_path)
-        print(f"Sample YAML config generated: {output_path}")
-    elif suffix == ".json":
-        config.save_json(output_path)
-        print(f"Sample JSON config generated: {output_path}")
-    else:
-        print(f"Warning: Unknown extension {suffix}, generating YAML")
-        output_path = output_path.with_suffix(".yaml")
-        config.save_yaml(output_path)
-        print(f"Sample YAML config generated: {output_path}")
+    try:
+        if suffix in [".yaml", ".yml"]:
+            config.save_yaml(output_path)
+            print(f"Sample YAML config generated: {output_path}")
+        elif suffix == ".json":
+            config.save_json(output_path)
+            print(f"Sample JSON config generated: {output_path}")
+        else:
+            # Try YAML first, fall back to JSON if PyYAML not available
+            try:
+                print(f"Warning: Unknown extension {suffix}, generating YAML")
+                output_path = output_path.with_suffix(".yaml")
+                config.save_yaml(output_path)
+                print(f"Sample YAML config generated: {output_path}")
+            except ImportError:
+                print("PyYAML not installed, generating JSON instead")
+                output_path = output_path.with_suffix(".json")
+                config.save_json(output_path)
+                print(f"Sample JSON config generated: {output_path}")
+    except ImportError:
+        print("\nERROR: PyYAML is required for YAML config files")
+        print("Install it with: pip install docpull[yaml]")
+        print("\nAlternatively, use JSON format:")
+        print(f"  docpull --generate-config {output_path.with_suffix('.json')}")
+        raise
 def get_config(args: argparse.Namespace) -> FetcherConfig:
@@ -224,7 +278,17 @@ def get_config(args: argparse.Namespace) -> FetcherConfig:
         FetcherConfig instance
     """
     # Load from config file if provided
-    config = FetcherConfig.from_file(args.config) if args.config else FetcherConfig()
+    if args.config:
+        try:
+            config = FetcherConfig.from_file(args.config)
+        except ImportError as e:
+            print(f"\nERROR: Error loading config file: {e}")
+            if "yaml" in str(e).lower() or "pyyaml" in str(e).lower():
+                print("Install PyYAML with: pip install docpull[yaml]")
+                print("\nAlternatively, convert your config to JSON format")
+            raise
+    else:
+        config = FetcherConfig()
     # Override with command-line arguments
     if args.output_dir is not None:
@@ -411,6 +475,13 @@ def main(argv: Optional[list[str]] = None) -> int:
     parser = create_parser()
     args = parser.parse_args(argv)
+    # Handle --doctor
+    if args.doctor:
+        from .doctor import run_doctor
+        output_dir = Path(args.output_dir) if args.output_dir else None
+        return run_doctor(output_dir=output_dir)
     # Handle --generate-config
     if args.generate_config:
         try:

docpull-1.1.0/docpull/doctor.py ADDED Viewed

@@ -0,0 +1,188 @@
+"""Diagnostic tool for verifying docpull installation and dependencies."""
+import sys
+from importlib import import_module
+from pathlib import Path
+from typing import Optional
+try:
+    from rich.console import Console
+    from rich.table import Table
+    RICH_AVAILABLE = True
+except ImportError:
+    RICH_AVAILABLE = False
+    Console = None  # type: ignore
+    Table = None  # type: ignore
+def check_dependency(
+    module_name: str, package_name: Optional[str] = None, optional: bool = False
+) -> tuple[bool, str]:
+    """
+    Check if a Python module is importable.
+    Args:
+        module_name: Name of the module to import
+        package_name: Display name of the package (defaults to module_name)
+        optional: Whether this is an optional dependency
+    Returns:
+        Tuple of (success: bool, message: str)
+    """
+    display_name = package_name or module_name
+    try:
+        import_module(module_name)
+        return True, f"[OK] {display_name}"
+    except ImportError:
+        if optional:
+            return False, f"[WARN] {display_name} (optional - not installed)"
+        else:
+            return False, f"[MISSING] {display_name}"
+def check_network() -> tuple[bool, str]:
+    """
+    Check basic network connectivity.
+    Returns:
+        Tuple of (success: bool, message: str)
+    """
+    try:
+        import socket
+        # Try to resolve a common DNS name
+        socket.gethostbyname("www.google.com")
+        return True, "[OK] Network connectivity"
+    except socket.gaierror:
+        return False, "[FAIL] Network connectivity - DNS resolution failed"
+    except Exception as e:
+        return False, f"[WARN] Network connectivity - {str(e)}"
+def check_output_dir(output_dir: Optional[Path] = None) -> tuple[bool, str]:
+    """
+    Check if output directory is writable.
+    Args:
+        output_dir: Directory to check (defaults to ./docs)
+    Returns:
+        Tuple of (success: bool, message: str)
+    """
+    test_dir = output_dir or Path("./docs")
+    try:
+        # Create directory if it doesn't exist
+        test_dir.mkdir(parents=True, exist_ok=True)
+        # Try to write a test file
+        test_file = test_dir / ".docpull_test"
+        test_file.write_text("test")
+        test_file.unlink()
+        return True, f"[OK] Output directory writable ({test_dir})"
+    except PermissionError:
+        return False, f"[FAIL] Output directory - permission denied ({test_dir})"
+    except Exception as e:
+        return False, f"[FAIL] Output directory - {str(e)} ({test_dir})"
+def run_doctor(output_dir: Optional[Path] = None, use_rich: bool = True) -> int:
+    """
+    Run diagnostic checks and display results.
+    Args:
+        output_dir: Output directory to check for writability
+        use_rich: Whether to use rich formatting (if available)
+    Returns:
+        Exit code (0 if all core dependencies OK, 1 if any core dependency missing)
+    """
+    # Determine if we can use rich formatting
+    use_rich = use_rich and RICH_AVAILABLE
+    print("Running docpull diagnostics...\n")
+    # Core dependencies
+    core_checks = [
+        ("requests", "requests"),
+        ("bs4", "beautifulsoup4"),
+        ("html2text", "html2text"),
+        ("defusedxml", "defusedxml"),
+        ("aiohttp", "aiohttp"),
+        ("rich", "rich"),
+    ]
+    # Optional dependencies
+    optional_checks = [
+        ("yaml", "pyyaml", True),
+        ("playwright.async_api", "playwright", True),
+    ]
+    # Other checks
+    system_checks = [
+        check_network(),
+        check_output_dir(output_dir),
+    ]
+    # Run core dependency checks
+    core_results = [check_dependency(mod, pkg) for mod, pkg in core_checks]
+    optional_results = [check_dependency(mod, pkg, opt) for mod, pkg, opt in optional_checks]
+    all_checks = {
+        "Core Dependencies": core_results,
+        "Optional Dependencies": optional_results,
+        "System": system_checks,
+    }
+    # Display results
+    if use_rich:
+        console = Console()
+        for category, results in all_checks.items():
+            table = Table(title=category, show_header=False, box=None)
+            table.add_column("Status", style="bold")
+            for success, message in results:
+                style = "green" if success else ("yellow" if "optional" in message else "red")
+                table.add_row(message, style=style)
+            console.print(table)
+            console.print()
+    else:
+        # Fallback to plain text
+        for category, results in all_checks.items():
+            print(f"{category}:")
+            for _success, message in results:
+                print(f"  {message}")
+            print()
+    # Check if any core dependencies failed
+    core_failed = any(not success for success, _ in core_results)
+    # Print summary
+    if core_failed:
+        print("\nWARNING: Some core dependencies are missing!")
+        print("\nRecommended fixes:")
+        print("  1. For pipx users: pipx reinstall docpull --force")
+        print("  2. For pip users: pip install --upgrade --force-reinstall docpull")
+        print("  3. For development: pip install -e .[dev]")
+        return 1
+    else:
+        print("\nAll core dependencies installed correctly!")
+        # Check if optional dependencies are missing
+        optional_missing = [msg for success, msg in optional_results if not success]
+        if optional_missing:
+            print("\nOptional features available:")
+            print("  - YAML config support: pip install docpull[yaml]")
+            print("  - JavaScript rendering: pip install docpull[js]")
+            print("  - All optional features: pip install docpull[all]")
+        return 0
+if __name__ == "__main__":
+    sys.exit(run_doctor())

docpull 1.0.1__tar.gz → 1.1.0__tar.gz

docpull 1.0.1tar.gz → 1.1.0tar.gz