PyPI - docpull - Versions diffs - 1.0.1__tar.gz - Mend

docpull 1.0.1__tar.gz

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (46) hide show

docpull-1.0.1/LICENSE +21 -0
docpull-1.0.1/PKG-INFO +440 -0
docpull-1.0.1/README.md +373 -0
docpull-1.0.1/docpull/__init__.py +29 -0
docpull-1.0.1/docpull/__main__.py +6 -0
docpull-1.0.1/docpull/cli.py +440 -0
docpull-1.0.1/docpull/config.py +199 -0
docpull-1.0.1/docpull/fetchers/__init__.py +23 -0
docpull-1.0.1/docpull/fetchers/async_fetcher.py +322 -0
docpull-1.0.1/docpull/fetchers/base.py +450 -0
docpull-1.0.1/docpull/fetchers/bun.py +59 -0
docpull-1.0.1/docpull/fetchers/d3.py +211 -0
docpull-1.0.1/docpull/fetchers/generic.py +255 -0
docpull-1.0.1/docpull/fetchers/generic_async.py +282 -0
docpull-1.0.1/docpull/fetchers/nextjs.py +50 -0
docpull-1.0.1/docpull/fetchers/parallel_base.py +93 -0
docpull-1.0.1/docpull/fetchers/plaid.py +92 -0
docpull-1.0.1/docpull/fetchers/react.py +59 -0
docpull-1.0.1/docpull/fetchers/stripe.py +60 -0
docpull-1.0.1/docpull/fetchers/tailwind.py +59 -0
docpull-1.0.1/docpull/fetchers/turborepo.py +57 -0
docpull-1.0.1/docpull/profiles/__init__.py +70 -0
docpull-1.0.1/docpull/profiles/base.py +64 -0
docpull-1.0.1/docpull/profiles/bun.py +14 -0
docpull-1.0.1/docpull/profiles/d3.py +17 -0
docpull-1.0.1/docpull/profiles/nextjs.py +15 -0
docpull-1.0.1/docpull/profiles/plaid.py +16 -0
docpull-1.0.1/docpull/profiles/react.py +14 -0
docpull-1.0.1/docpull/profiles/stripe.py +14 -0
docpull-1.0.1/docpull/profiles/tailwind.py +14 -0
docpull-1.0.1/docpull/profiles/turborepo.py +14 -0
docpull-1.0.1/docpull/py.typed +0 -0
docpull-1.0.1/docpull/utils/__init__.py +6 -0
docpull-1.0.1/docpull/utils/file_utils.py +97 -0
docpull-1.0.1/docpull/utils/logging_config.py +54 -0
docpull-1.0.1/docpull.egg-info/PKG-INFO +440 -0
docpull-1.0.1/docpull.egg-info/SOURCES.txt +44 -0
docpull-1.0.1/docpull.egg-info/dependency_links.txt +1 -0
docpull-1.0.1/docpull.egg-info/entry_points.txt +2 -0
docpull-1.0.1/docpull.egg-info/requires.txt +29 -0
docpull-1.0.1/docpull.egg-info/top_level.txt +1 -0
docpull-1.0.1/pyproject.toml +162 -0
docpull-1.0.1/setup.cfg +4 -0
docpull-1.0.1/tests/test_async_fetcher.py +147 -0
docpull-1.0.1/tests/test_config.py +46 -0
docpull-1.0.1/tests/test_fetchers.py +57 -0

docpull-1.0.1/LICENSE ADDED Viewed

@@ -0,0 +1,21 @@
+MIT License
+Copyright (c) 2025 Raintree Technology
+Permission is hereby granted, free of charge, to any person obtaining a copy
+of this software and associated documentation files (the "Software"), to deal
+in the Software without restriction, including without limitation the rights
+to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
+copies of the Software, and to permit persons to whom the Software is
+furnished to do so, subject to the following conditions:
+The above copyright notice and this permission notice shall be included in all
+copies or substantial portions of the Software.
+THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
+IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
+FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
+AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
+LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
+OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
+SOFTWARE.

docpull-1.0.1/PKG-INFO ADDED Viewed

@@ -0,0 +1,440 @@
+Metadata-Version: 2.4
+Name: docpull
+Version: 1.0.1
+Summary: Pull documentation from the web and convert to clean markdown
+Author-email: Zachary Roth <support@raintree.technology>
+Maintainer-email: Raintree Technology <support@raintree.technology>
+License-Expression: MIT
+Project-URL: Homepage, https://github.com/raintree-technology/docpull
+Project-URL: Documentation, https://github.com/raintree-technology/docpull#readme
+Project-URL: Repository, https://github.com/raintree-technology/docpull
+Project-URL: Source Code, https://github.com/raintree-technology/docpull
+Project-URL: Bug Tracker, https://github.com/raintree-technology/docpull/issues
+Project-URL: Changelog, https://github.com/raintree-technology/docpull/blob/main/CHANGELOG.md
+Keywords: python,markdown,documentation,web-scraping,developer-tools,claude,ai-training-data
+Classifier: Development Status :: 5 - Production/Stable
+Classifier: Intended Audience :: Developers
+Classifier: Intended Audience :: Information Technology
+Classifier: Intended Audience :: Science/Research
+Classifier: Intended Audience :: Education
+Classifier: Environment :: Console
+Classifier: Topic :: Documentation
+Classifier: Topic :: Internet :: WWW/HTTP :: Indexing/Search
+Classifier: Topic :: Software Development :: Documentation
+Classifier: Topic :: Text Processing :: Markup :: HTML
+Classifier: Topic :: Text Processing :: Markup :: Markdown
+Classifier: Topic :: Utilities
+Classifier: Topic :: Scientific/Engineering :: Artificial Intelligence
+Classifier: Natural Language :: English
+Classifier: Operating System :: OS Independent
+Classifier: Programming Language :: Python :: 3
+Classifier: Programming Language :: Python :: 3.9
+Classifier: Programming Language :: Python :: 3.10
+Classifier: Programming Language :: Python :: 3.11
+Classifier: Programming Language :: Python :: 3.12
+Classifier: Programming Language :: Python :: 3.13
+Classifier: Programming Language :: Python :: 3 :: Only
+Classifier: Typing :: Typed
+Requires-Python: >=3.9
+Description-Content-Type: text/markdown
+License-File: LICENSE
+Requires-Dist: requests>=2.31.0
+Requires-Dist: beautifulsoup4>=4.12.0
+Requires-Dist: html2text>=2020.1.16
+Requires-Dist: defusedxml>=0.7.1
+Requires-Dist: aiohttp>=3.9.0
+Requires-Dist: rich>=13.0.0
+Provides-Extra: yaml
+Requires-Dist: pyyaml>=6.0; extra == "yaml"
+Provides-Extra: js
+Requires-Dist: playwright>=1.40.0; extra == "js"
+Provides-Extra: all
+Requires-Dist: pyyaml>=6.0; extra == "all"
+Requires-Dist: playwright>=1.40.0; extra == "all"
+Provides-Extra: dev
+Requires-Dist: pytest>=7.0.0; extra == "dev"
+Requires-Dist: pytest-cov>=4.0.0; extra == "dev"
+Requires-Dist: pytest-asyncio>=0.21.0; extra == "dev"
+Requires-Dist: black>=23.0.0; extra == "dev"
+Requires-Dist: mypy>=1.0.0; extra == "dev"
+Requires-Dist: ruff>=0.1.0; extra == "dev"
+Requires-Dist: bandit>=1.7.0; extra == "dev"
+Requires-Dist: pip-audit>=2.0.0; extra == "dev"
+Requires-Dist: types-requests>=2.31.0; extra == "dev"
+Requires-Dist: types-beautifulsoup4>=4.12.0; extra == "dev"
+Requires-Dist: types-aiohttp>=3.9.0; extra == "dev"
+Dynamic: license-file
+# docpull
+**Pull documentation from ANY website and convert to clean, AI-ready markdown.**
+[![Python 3.9+](https://img.shields.io/badge/python-3.9+-blue.svg)](https://www.python.org/downloads/)
+[![PyPI version](https://badge.fury.io/py/docpull.svg)](https://badge.fury.io/py/docpull)
+[![License: MIT](https://img.shields.io/github/license/raintree-technology/docpull)](https://github.com/raintree-technology/docpull/blob/main/LICENSE)
+[![Code style: black](https://img.shields.io/badge/code%20style-black-000000.svg)](https://github.com/psf/black)
+[![Type checked: mypy](https://img.shields.io/badge/type%20checked-mypy-blue.svg)](http://mypy-lang.org/)
+[![Security: bandit](https://img.shields.io/badge/security-bandit-yellow.svg)](https://github.com/PyCQA/bandit)
+## Why docpull?
+Unlike wget or httrack that dump messy HTML, **docpull extracts clean markdown** perfect for:
+- Training AI models / RAG systems
+- Building knowledge bases
+- Creating searchable documentation archives
+- Offline documentation reading
+**Production-ready**: Full type safety (mypy), security scanning (Bandit), zero linting issues (Ruff), comprehensive test coverage, and no known vulnerabilities.
+## Features
+- **Universal**: Scrape ANY documentation site - not limited to predefined sources
+- **Smart extraction**: Auto-detects main content, removes navigation/ads
+- **Blazing fast**: Async/parallel fetching (10x faster than sync)
+- **JavaScript support**: Handles JS-heavy sites with Playwright
+- **Progress bars**: Beautiful real-time progress with Rich
+- **Sitemap support**: Auto-discovers pages via sitemap.xml
+- **Link crawling**: Optionally follows links to discover all pages
+- **Secure**: Rate limiting, content validation, timeout controls
+- **Clean output**: Markdown with YAML frontmatter
+- **Configurable**: Control depth, page limits, concurrency
+- **Resumable**: Skip already-fetched files
+## Quick Start
+```bash
+# Install
+pip install docpull
+# Scrape ANY documentation site
+docpull https://aptos.dev
+docpull https://docs.anthropic.com
+docpull https://go.dev/doc
+# Use optimized profiles for popular sites
+docpull stripe
+docpull nextjs react
+# Control scraping behavior
+docpull https://newsite.com/docs --max-pages 100 --max-concurrent 20
+```
+## Installation
+```bash
+# Basic installation
+pip install docpull
+# With YAML config support
+pip install docpull[yaml]
+# With JavaScript rendering (for JS-heavy sites)
+pip install docpull[js]
+python -m playwright install chromium
+# Everything
+pip install docpull[all]
+python -m playwright install chromium
+```
+## Usage
+### Scrape Any URL
+The primary way to use docpull is by providing any documentation URL:
+```bash
+# Single site
+docpull https://aptos.dev
+# Multiple sites
+docpull https://aptos.dev https://docs.soliditylang.org
+# Control crawling
+docpull https://docs.example.com \
+  --max-pages 200 \
+  --max-depth 4 \
+  --rate-limit 1.0
+```
+### Use Optimized Profiles
+For popular documentation sites, use shortcut names for optimized scraping:
+```bash
+# Single profile
+docpull stripe
+# Multiple profiles
+docpull stripe plaid nextjs
+# Mix profiles and URLs
+docpull stripe https://newsite.com/docs
+```
+### JavaScript Rendering
+For sites that require JavaScript to render content:
+```bash
+# Enable JS rendering with Playwright
+docpull https://js-heavy-site.com --js
+# Combine with other options
+docpull https://site.com --js --max-pages 50 --max-concurrent 5
+```
+**Note:** JS rendering is slower but handles modern SPAs and dynamically-loaded content.
+### Available Profiles
+| Profile | Site | Optimizations |
+|---------|------|---------------|
+| `stripe` | docs.stripe.com | Filters changelog, focused on API docs |
+| `nextjs` | nextjs.org | Excludes blog/showcase, docs only |
+| `react` | react.dev | Learn & reference sections only |
+| `plaid` | plaid.com | API + guides, excludes marketing |
+| `tailwind` | tailwindcss.com | Documentation only |
+| `bun` | bun.sh | Runtime documentation |
+| `d3` | d3js.org | Data visualization docs |
+| `turborepo` | turbo.build | Monorepo tooling docs |
+### Python API
+```python
+from docpull import GenericAsyncFetcher
+# Scrape any URL (async/parallel)
+fetcher = GenericAsyncFetcher(
+    url_or_profile="https://aptos.dev",
+    output_dir="./docs",
+    max_pages=100,
+    max_concurrent=20,
+    use_js=False,  # Set to True for JS rendering
+)
+fetcher.fetch()
+# Or use a profile
+fetcher = GenericAsyncFetcher(
+    url_or_profile="stripe",
+    output_dir="./docs",
+)
+fetcher.fetch()
+```
+### Advanced Options
+```bash
+# Limit pages and depth
+docpull https://docs.example.com --max-pages 50 --max-depth 2
+# Control concurrent requests (default: 10)
+docpull https://site.com --max-concurrent 20
+# Enable JavaScript rendering
+docpull https://site.com --js
+# Custom output directory
+docpull stripe --output-dir ./my-docs
+# Adjust rate limiting
+docpull https://site.com --rate-limit 2.0
+# Re-fetch existing files
+docpull stripe --no-skip-existing
+# Verbose logging
+docpull https://site.com --verbose
+# Disable progress bars
+docpull https://site.com --no-progress
+# Dry run (see what would be fetched)
+docpull https://site.com --dry-run
+```
+## Performance
+**Async/Parallel Fetching** makes docpull **10x faster** than traditional sync scrapers:
+| Pages | Sync (old) | Async (new) | Speedup |
+|-------|-----------|-------------|---------|
+| 5 | ~5.0s | ~1.8s | 2.8x faster |
+| 50 | ~50s | ~6s | 8.3x faster |
+| 500 | ~500s | ~45s | 11x faster |
+With `--max-concurrent 20`, even faster for large sites!
+## Output Format
+Each page is saved as markdown with YAML frontmatter:
+```markdown
+---
+url: https://stripe.com/docs/payments
+fetched: 2025-11-13
+---
+# Payment Intents
+Your clean documentation content here...
+```
+Files are organized by URL structure:
+```
+docs/
+├── stripe/
+│   ├── api/
+│   │   ├── charges.md
+│   │   └── customers.md
+│   └── payments/
+│       └── payment-intents.md
+└── aptos_dev/
+    ├── guides/
+    │   └── getting-started.md
+    └── reference/
+        └── api.md
+```
+## How It Works
+1. **Discovery**: Tries sitemap.xml first, falls back to link crawling
+2. **Filtering**: Applies URL patterns to focus on documentation
+3. **Extraction**: Removes nav/footer/ads, extracts main content
+4. **Conversion**: Converts HTML to clean markdown
+5. **Organization**: Saves with structure that mirrors the site
+6. **Async Magic**: Fetches multiple pages concurrently with rate limiting
+## Configuration File
+Create `config.yaml` for complex setups:
+```yaml
+output_dir: ./docs
+rate_limit: 0.5
+skip_existing: true
+log_level: INFO
+sources:
+  - stripe
+  - nextjs
+  - react
+```
+Run with:
+```bash
+docpull --config config.yaml
+```
+## Creating Custom Profiles
+You can create optimized profiles for your frequently-scraped sites:
+```python
+from docpull.profiles.base import SiteProfile
+MY_PROFILE = SiteProfile(
+    name="mysite",
+    domains={"docs.mysite.com"},
+    sitemap_url="https://docs.mysite.com/sitemap.xml",
+    base_url="https://docs.mysite.com/",
+    include_patterns=["/docs/", "/api/"],
+    exclude_patterns=["/blog/"],
+    output_subdir="mysite",
+    rate_limit=0.5,
+)
+```
+## Security
+docpull is designed with security in mind:
+- **HTTPS-only** by default
+- **Private IP blocking** (no localhost, 192.168.x.x, etc.)
+- **Content size limits** (50MB max per page)
+- **Timeout controls** (30s per request)
+- **Rate limiting** (async-safe, prevents DoS)
+- **Concurrent connection limits** (prevents overwhelming servers)
+- **Content-type validation** (only fetches HTML/XML)
+- **Playwright sandboxing** (when using --js)
+See [SECURITY.md](SECURITY.md) for detailed security information.
+## Comparison with Alternatives
+| Tool | Output | Works on any site? | Clean extraction? | Speed | JS Support |
+|------|--------|-------------------|-------------------|-------|------------|
+| **docpull** | Clean markdown | Yes | Yes | Fast (async) | Optional |
+| wget | Raw HTML | Yes | No | Slow (sync) | No |
+| httrack | Raw HTML | Yes | No | Slow (sync) | No |
+| Site-specific | Varies | No | Varies | Varies | No |
+## Troubleshooting
+### Site requires JavaScript
+```bash
+# Install Playwright support
+pip install docpull[js]
+python -m playwright install chromium
+# Use --js flag
+docpull https://site.com --js
+```
+### Too slow / rate limited
+```bash
+# Reduce concurrent requests
+docpull https://site.com --max-concurrent 5 --rate-limit 2.0
+```
+### Memory issues on large sites
+```bash
+# Limit pages fetched
+docpull https://site.com --max-pages 1000
+```
+## Contributing
+Contributions welcome! To add:
+- **New site profiles**: Create a profile in `docpull/profiles/`
+- **Better extraction**: Improve content detection in `fetchers/base.py`
+- **Performance improvements**: Optimize async fetching
+- **Bug reports**: Use the [issue tracker](https://github.com/raintree-technology/docpull/issues)
+### Development Setup
+```bash
+# Clone and install
+git clone https://github.com/raintree-technology/docpull
+cd docpull
+pip install -e ".[dev]"
+# Run all quality checks (as per CI)
+black --check .           # Code formatting
+ruff check .              # Linting
+mypy docpull              # Type checking
+bandit -r docpull         # Security scanning
+pip-audit                 # Dependency vulnerabilities
+pytest --cov=docpull -v   # Tests with coverage
+```
+All PRs must pass these checks before merging.
+## Documentation
+- [Changelog](CHANGELOG.md)
+- [Security Policy](SECURITY.md)
+## License
+MIT License - see [LICENSE](LICENSE) file for details
+## Links
+- [PyPI](https://pypi.org/project/docpull/)
+- [GitHub](https://github.com/raintree-technology/docpull)
+- [Issues](https://github.com/raintree-technology/docpull/issues)