PyPI - docpull - Versions diffs - 1.2.1__tar.gz → 1.5.0__tar.gz - Mend

docpull 1.2.1tar.gz → 1.5.0tar.gz

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (89) hide show

{docpull-1.2.1 → docpull-1.5.0}/PKG-INFO +137 -54
{docpull-1.2.1 → docpull-1.5.0}/README.md +128 -52
docpull-1.5.0/docpull/__init__.py +13 -0
{docpull-1.2.1 → docpull-1.5.0}/docpull/cli.py +78 -140
{docpull-1.2.1 → docpull-1.5.0}/docpull/config.py +32 -11
docpull-1.5.0/docpull/fetchers/__init__.py +11 -0
{docpull-1.2.1 → docpull-1.5.0}/docpull/fetchers/async_fetcher.py +172 -31
{docpull-1.2.1 → docpull-1.5.0}/docpull/fetchers/base.py +246 -9
{docpull-1.2.1 → docpull-1.5.0}/docpull/fetchers/generic.py +25 -65
{docpull-1.2.1 → docpull-1.5.0}/docpull/fetchers/generic_async.py +105 -63
docpull-1.5.0/docpull/metadata_extractor.py +283 -0
{docpull-1.2.1 → docpull-1.5.0}/docpull/sources_config.py +1 -0
docpull-1.5.0/docpull.egg-info/PKG-INFO +478 -0
docpull-1.5.0/docpull.egg-info/SOURCES.txt +49 -0
docpull-1.5.0/docpull.egg-info/dependency_links.txt +1 -0
docpull-1.5.0/docpull.egg-info/entry_points.txt +2 -0
docpull-1.5.0/docpull.egg-info/requires.txt +38 -0
docpull-1.5.0/docpull.egg-info/top_level.txt +1 -0
{docpull-1.2.1 → docpull-1.5.0}/pyproject.toml +15 -2
{docpull-1.2.1 → docpull-1.5.0}/tests/test_config.py +1 -5
docpull-1.5.0/tests/test_metadata_extractor.py +233 -0
docpull-1.2.1/.editorconfig +0 -30
docpull-1.2.1/.pre-commit-config.yaml +0 -30
docpull-1.2.1/CHANGELOG.md +0 -328
docpull-1.2.1/CONTRIBUTING.md +0 -189
docpull-1.2.1/MANIFEST.in +0 -49
docpull-1.2.1/Makefile +0 -44
docpull-1.2.1/SECURITY.md +0 -206
docpull-1.2.1/TROUBLESHOOTING.md +0 -348
docpull-1.2.1/docpull/__init__.py +0 -29
docpull-1.2.1/docpull/fetchers/__init__.py +0 -23
docpull-1.2.1/docpull/fetchers/bun.py +0 -59
docpull-1.2.1/docpull/fetchers/d3.py +0 -211
docpull-1.2.1/docpull/fetchers/nextjs.py +0 -59
docpull-1.2.1/docpull/fetchers/plaid.py +0 -89
docpull-1.2.1/docpull/fetchers/react.py +0 -59
docpull-1.2.1/docpull/fetchers/stripe.py +0 -49
docpull-1.2.1/docpull/fetchers/tailwind.py +0 -59
docpull-1.2.1/docpull/fetchers/turborepo.py +0 -57
docpull-1.2.1/docpull/profiles/__init__.py +0 -70
docpull-1.2.1/docpull/profiles/base.py +0 -64
docpull-1.2.1/docpull/profiles/bun.py +0 -14
docpull-1.2.1/docpull/profiles/d3.py +0 -17
docpull-1.2.1/docpull/profiles/nextjs.py +0 -15
docpull-1.2.1/docpull/profiles/plaid.py +0 -16
docpull-1.2.1/docpull/profiles/react.py +0 -14
docpull-1.2.1/docpull/profiles/stripe.py +0 -14
docpull-1.2.1/docpull/profiles/tailwind.py +0 -14
docpull-1.2.1/docpull/profiles/turborepo.py +0 -14
docpull-1.2.1/docpull/utils/__init__.py +0 -6
docpull-1.2.1/docpull.egg-info/SOURCES.txt +0 -76
docpull-1.2.1/examples/README.md +0 -280
docpull-1.2.1/examples/deduplication-strategies.yaml +0 -29
docpull-1.2.1/examples/format-conversion.yaml +0 -25
docpull-1.2.1/examples/incremental-updates.yaml +0 -26
docpull-1.2.1/examples/multi-source-optimized.yaml +0 -45
docpull-1.2.1/examples/selective-crawling.yaml +0 -26
docpull-1.2.1/examples/simple-optimization.yaml +0 -14
docpull-1.2.1/requirements.txt +0 -34
{docpull-1.2.1 → docpull-1.5.0}/LICENSE +0 -0
{docpull-1.2.1 → docpull-1.5.0}/docpull/__main__.py +0 -0
{docpull-1.2.1 → docpull-1.5.0}/docpull/archive.py +0 -0
{docpull-1.2.1 → docpull-1.5.0}/docpull/cache.py +0 -0
{docpull-1.2.1 → docpull-1.5.0}/docpull/doctor.py +0 -0
{docpull-1.2.1 → docpull-1.5.0}/docpull/fetchers/parallel_base.py +0 -0
{docpull-1.2.1/docpull/utils → docpull-1.5.0/docpull}/file_utils.py +0 -0
{docpull-1.2.1 → docpull-1.5.0}/docpull/formatters/__init__.py +0 -0
{docpull-1.2.1 → docpull-1.5.0}/docpull/formatters/base.py +0 -0
{docpull-1.2.1 → docpull-1.5.0}/docpull/formatters/json.py +0 -0
{docpull-1.2.1 → docpull-1.5.0}/docpull/formatters/markdown.py +0 -0
{docpull-1.2.1 → docpull-1.5.0}/docpull/formatters/sqlite.py +0 -0
{docpull-1.2.1 → docpull-1.5.0}/docpull/formatters/toon.py +0 -0
{docpull-1.2.1 → docpull-1.5.0}/docpull/hooks.py +0 -0
{docpull-1.2.1 → docpull-1.5.0}/docpull/indexer.py +0 -0
{docpull-1.2.1/docpull/utils → docpull-1.5.0/docpull}/logging_config.py +0 -0
{docpull-1.2.1 → docpull-1.5.0}/docpull/metadata.py +0 -0
{docpull-1.2.1 → docpull-1.5.0}/docpull/naming.py +0 -0
{docpull-1.2.1 → docpull-1.5.0}/docpull/orchestrator.py +0 -0
{docpull-1.2.1 → docpull-1.5.0}/docpull/processors/__init__.py +0 -0
{docpull-1.2.1 → docpull-1.5.0}/docpull/processors/base.py +0 -0
{docpull-1.2.1 → docpull-1.5.0}/docpull/processors/content_filter.py +0 -0
{docpull-1.2.1 → docpull-1.5.0}/docpull/processors/deduplicator.py +0 -0
{docpull-1.2.1 → docpull-1.5.0}/docpull/processors/language_filter.py +0 -0
{docpull-1.2.1 → docpull-1.5.0}/docpull/processors/size_limiter.py +0 -0
{docpull-1.2.1 → docpull-1.5.0}/docpull/py.typed +0 -0
{docpull-1.2.1 → docpull-1.5.0}/docpull/vcs.py +0 -0
{docpull-1.2.1 → docpull-1.5.0}/setup.cfg +0 -0
{docpull-1.2.1 → docpull-1.5.0}/tests/test_orchestrator.py +0 -0
{docpull-1.2.1 → docpull-1.5.0}/tests/test_sources_config.py +0 -0

{docpull-1.2.1 → docpull-1.5.0}/PKG-INFO RENAMED Viewed

@@ -1,6 +1,6 @@
 Metadata-Version: 2.4
 Name: docpull
-Version: 1.2.1
+Version: 1.5.0
 Summary: Pull documentation from the web and convert to clean markdown
 Author-email: Zachary Roth <support@raintree.technology>
 Maintainer-email: Raintree Technology <support@raintree.technology>
@@ -10,7 +10,7 @@ Project-URL: Documentation, https://github.com/raintree-technology/docpull#readm
 Project-URL: Repository, https://github.com/raintree-technology/docpull
 Project-URL: Source Code, https://github.com/raintree-technology/docpull
 Project-URL: Bug Tracker, https://github.com/raintree-technology/docpull/issues
-Project-URL: Changelog, https://github.com/raintree-technology/docpull/blob/main/CHANGELOG.md
+Project-URL: Releases, https://github.com/raintree-technology/docpull/releases
 Keywords: python,markdown,documentation,web-scraping,developer-tools,claude,ai-training-data
 Classifier: Development Status :: 5 - Production/Stable
 Classifier: Intended Audience :: Developers
@@ -43,14 +43,21 @@ Requires-Dist: requests>=2.31.0
 Requires-Dist: beautifulsoup4>=4.12.0
 Requires-Dist: html2text>=2020.1.16
 Requires-Dist: defusedxml>=0.7.1
+Requires-Dist: extruct>=0.15.0
 Requires-Dist: aiohttp>=3.9.0
 Requires-Dist: rich>=13.0.0
 Requires-Dist: pyyaml>=6.0
 Requires-Dist: gitpython>=3.1.40
 Provides-Extra: js
 Requires-Dist: playwright>=1.40.0; extra == "js"
+Provides-Extra: proxy
+Requires-Dist: aiohttp-socks>=0.8.0; extra == "proxy"
+Provides-Extra: normalize
+Requires-Dist: url-normalize>=1.4.0; extra == "normalize"
 Provides-Extra: all
 Requires-Dist: playwright>=1.40.0; extra == "all"
+Requires-Dist: aiohttp-socks>=0.8.0; extra == "all"
+Requires-Dist: url-normalize>=1.4.0; extra == "all"
 Provides-Extra: dev
 Requires-Dist: pytest>=7.0.0; extra == "dev"
 Requires-Dist: pytest-cov>=4.0.0; extra == "dev"
@@ -72,7 +79,11 @@ Dynamic: license-file
 **Pull documentation from any website and converts it into clean, AI-ready Markdown.**
 Fast, type-safe, secure, and optimized for building knowledge bases or training datasets.
-**NEW in v1.2.0**: 15 major features including language filtering, deduplication, auto-indexing, multi-source configuration, and more. Real-world testing shows **58% size reduction** with automatic optimization.
+**NEW in v1.5.0**: Proxy support, retry with exponential backoff, custom User-Agent, and mandatory robots.txt compliance for TOS-friendly scraping.
+**v1.3.0**: Rich structured metadata extraction (Open Graph, JSON-LD) for enhanced AI/RAG integration.
+**v1.2.0**: 15 major features including language filtering, deduplication, auto-indexing, multi-source configuration, and more. Real-world testing shows **58% size reduction** with automatic optimization.
 [![Python 3.9+](https://img.shields.io/badge/python-3.9+-blue.svg)](https://www.python.org/downloads/)
 [![PyPI version](https://badge.fury.io/py/docpull.svg)](https://badge.fury.io/py/docpull)
@@ -95,9 +106,22 @@ Unlike tools like wget or httrack, docpull extracts only the main content, remov
 - Sitemap + link crawling
 - Rate limiting, timeouts, content-type checks
 - Saves docs in structured Markdown with YAML metadata
-- Optimized profiles for popular platforms (Stripe, Next.js, React, Plaid, Tailwind, etc.)
-### NEW in v1.2.0: Advanced Optimization
+- **Mandatory robots.txt compliance** for TOS-friendly scraping
+### NEW in v1.5.0: Network & Reliability
+- **Proxy Support**: HTTP, HTTPS, and SOCKS5 proxies via `--proxy` or env vars
+- **Retry with Exponential Backoff**: Configurable retries for transient failures
+- **Custom User-Agent**: Set custom User-Agent strings for requests
+- **Crawl-delay Compliance**: Automatically respects robots.txt Crawl-delay directives
+- **Better Encoding Detection**: Intelligent charset detection for international docs
+### v1.3.0: Rich Metadata Extraction
+- **Structured Metadata**: Extract Open Graph, JSON-LD, and microdata during fetch
+- **Enhanced Frontmatter**: Adds author, description, keywords, images, publish dates, and more
+- **AI/RAG Ready**: Richer context for embeddings and retrieval systems
+- **Opt-in Feature**: Enabled with `--rich-metadata` flag
+### v1.2.0: Advanced Optimization
 - **Language Filtering**: Auto-detect and filter by language (skip 352+ translation files)
 - **Deduplication**: Remove duplicates with SHA-256 hashing (save 10+ MB on duplicate content)
 - **Auto-Index Generation**: Create navigable INDEX.md with tree/TOC/categories/stats
@@ -124,11 +148,14 @@ docpull --doctor         # verify installation
 # Basic usage
 docpull https://aptos.dev
-docpull stripe           # use a built-in profile
+docpull https://docs.anthropic.com
 # NEW: Simple optimization (v1.2.0)
 docpull https://code.claude.com/docs --language en --create-index
+# NEW: Rich metadata extraction (v1.3.0)
+docpull https://docs.anthropic.com --rich-metadata --create-index
 # NEW: Advanced optimization (v1.2.0)
 docpull https://aptos.dev \
   --deduplicate \
@@ -154,7 +181,7 @@ docpull https://site.com --js
 from docpull import GenericAsyncFetcher
 fetcher = GenericAsyncFetcher(
-    url_or_profile="https://aptos.dev",
+    url="https://aptos.dev",
     output_dir="./docs",
     max_pages=100,
     max_concurrent=20,
@@ -189,6 +216,7 @@ fetcher.fetch()
 - `--naming-strategy {full,short,flat,hierarchical}` – file naming strategy
 - `--create-index` – generate INDEX.md with navigation
 - `--extract-metadata` – extract metadata to metadata.json
+- `--rich-metadata` – extract rich structured metadata (Open Graph, JSON-LD) during fetch
 - `--update-only-changed` – only download changed files
 - `--incremental` – enable incremental mode with resume
 - `--git-commit` – auto-commit changes
@@ -197,6 +225,12 @@ fetcher.fetch()
 - `--archive-format {tar.gz,tar.bz2,tar.xz,zip}` – archive format
 - `--sources-file PATH` – multi-source configuration file
+### NEW in v1.5.0: Network Options
+- `--proxy URL` – proxy URL (HTTP, HTTPS, SOCKS5)
+- `--user-agent STRING` – custom User-Agent string
+- `--max-retries N` – max retry attempts for failed requests (default: 3)
+- `--retry-base-delay SECONDS` – base delay for exponential backoff (default: 1.0)
 See `docpull --help` for complete list of options.
 ## Performance
@@ -215,33 +249,36 @@ Each downloaded page becomes a Markdown file:
 ```markdown
 ---
-url: https://stripe.com/docs/payments
-fetched: 2025-11-13
+url: https://aptos.dev/build/guides/first-transaction
+fetched: 2025-11-28
 ---
-# Payment Intents
+# Your First Transaction
 ...
 ```
-Directory layout mirrors the target site's structure.
-## Configuration File
-### Simple Configuration (v1.0+)
+With `--rich-metadata`, the frontmatter includes Open Graph, JSON-LD, and other structured metadata:
-```yaml
-output_dir: ./docs
-rate_limit: 0.5
-sources:
-  - stripe
-  - nextjs
+```markdown
+---
+url: https://aptos.dev/build/guides/first-transaction
+fetched: 2025-11-28
+title: Your First Transaction
+description: Learn how to submit your first transaction on Aptos
+author: Aptos Foundation
+keywords: [aptos, blockchain, transaction, guide]
+image: https://aptos.dev/img/docs-preview.png
+type: article
+site_name: Aptos Documentation
+---
+# Your First Transaction
+...
 ```
-Run with:
-```bash
-docpull --config config.yaml
-```
+Directory layout mirrors the target site's structure.
+## Configuration File
-### NEW: Multi-Source Configuration (v1.2.0)
+### Multi-Source Configuration
 ```yaml
 sources:
@@ -250,6 +287,7 @@ sources:
     language: en
     max_file_size: 200kb
     create_index: true
+    rich_metadata: true  # Extract Open Graph, JSON-LD metadata
   claude-code:
     url: https://code.claude.com/docs
@@ -279,38 +317,27 @@ docpull --sources-file config.yaml
 See `examples/` directory for more configuration examples.
-## Custom Profiles
-Easily define profiles for frequently scraped sites.
-```python
-from docpull.profiles.base import SiteProfile
-MY_PROFILE = SiteProfile(
-    name="mysite",
-    domains={"docs.mysite.com"},
-    include_patterns=["/docs/", "/api/"],
-)
-```
 ## Security
-- HTTPS-only
-- Blocks private network IPs
+- HTTPS-only (HTTP rejected)
+- **Mandatory robots.txt compliance** (cannot be disabled)
+- Respects Crawl-delay directives
+- Blocks private/internal network IPs
 - 50MB page size limit
-- Timeout controls
-- Validates content-type
-- Playwright sandboxing
+- Timeout controls (30s connection, 5min download)
+- Validates content-type headers
+- Playwright sandboxing for JS rendering
+- Path traversal protection
 ## Troubleshooting
 - **Installation issues**: Run `docpull --doctor` to diagnose problems
-- **Missing dependencies**: See [TROUBLESHOOTING.md](TROUBLESHOOTING.md) for common fixes
-- **Site requires JS**: install Playwright + `--js`
-- **Slow or rate limited**: lower concurrency or raise `--rate-limit`
-- **Large sites**: set `--max-pages`
-For detailed troubleshooting, see [TROUBLESHOOTING.md](TROUBLESHOOTING.md).
+- **Missing dependencies**: `pip install docpull[all]` for all optional dependencies
+- **Site requires JS**: `pip install docpull[js]` then `python -m playwright install chromium`
+- **Slow or rate limited**: Lower `--max-concurrent` or raise `--rate-limit`
+- **Large sites**: Set `--max-pages` to limit crawl size
+- **Proxy issues**: Use `--proxy URL` or set `DOCPULL_PROXY` / `HTTPS_PROXY` env var
+- **Transient failures**: Increase `--max-retries` (default: 3)
 ## v1.2.0 Feature Examples
@@ -366,9 +393,65 @@ See `examples/` directory for comprehensive configuration examples.
 - **After**: 1,250 files, 13 MB (58% reduction), full indexes generated
 - **One command** instead of 4+ separate commands with manual optimization
+## What's New in v1.5.0
+This release focuses on network reliability, proxy support, and TOS compliance.
+**New Features**:
+- **Proxy Support**: HTTP, HTTPS, and SOCKS5 proxies
+  - Use `--proxy URL` or set `DOCPULL_PROXY` / `HTTPS_PROXY` environment variables
+  - Install SOCKS support: `pip install docpull[proxy]`
+- **Retry with Exponential Backoff**: Automatic retries for transient failures
+  - `--max-retries N` (default: 3)
+  - `--retry-base-delay SECONDS` (default: 1.0)
+  - Handles 429, 500, 502, 503, 504 status codes
+- **Custom User-Agent**: `--user-agent STRING` for custom identification
+- **Better Encoding Detection**: Intelligent charset detection using charset-normalizer
+- **Crawl-delay Compliance**: Automatically respects robots.txt Crawl-delay directives
+**Security Enhancement**:
+- **Mandatory robots.txt Compliance**: robots.txt is now always respected (cannot be disabled)
+  - Ensures TOS-friendly scraping behavior
+  - Automatically adjusts rate limiting based on Crawl-delay
+**Codebase Simplification**:
+- Removed built-in profiles (Stripe, etc.) - use URLs directly
+- Consolidated utility modules
+- Moved CONTRIBUTING.md, SECURITY.md to `.github/` directory
+**Backward Compatible**: All existing workflows continue to work unchanged.
+## What's New in v1.3.0
+This release adds rich structured metadata extraction for better AI/RAG integration.
+**New Feature**:
+- **Rich Metadata Extraction**: Extract Open Graph, JSON-LD, microdata, and other structured metadata during fetch
+  - Adds author, description, keywords, images, publish dates, and more to frontmatter
+  - Enhances AI/RAG systems with richer context
+  - Enabled with `--rich-metadata` flag or `rich_metadata: true` in config
+  - Powered by the extruct library
+**Example enhanced frontmatter**:
+```yaml
+---
+url: https://docs.example.com/guide
+fetched: 2025-11-20
+title: Getting Started Guide
+description: Learn the basics of our platform
+author: John Doe
+keywords: [tutorial, guide, api]
+image: https://docs.example.com/og-image.png
+type: article
+published_time: 2024-01-15T10:00:00Z
+---
+```
+**Backward Compatible**: All existing workflows continue to work unchanged. Rich metadata is opt-in.
 ## What's New in v1.2.0
-This release adds 15 major features across 4 phases. See [CHANGELOG.md](CHANGELOG.md) for complete release notes.
+This release adds 15 major features across 4 phases.
 **Highlights**:
 - Multi-source YAML configuration
@@ -387,7 +470,7 @@ This release adds 15 major features across 4 phases. See [CHANGELOG.md](CHANGELO
 - [PyPI](https://pypi.org/project/docpull/)
 - [GitHub](https://github.com/raintree-technology/docpull)
 - [Issues](https://github.com/raintree-technology/docpull/issues)
-- [Changelog](https://github.com/raintree-technology/docpull/blob/main/CHANGELOG.md)
+- [Releases](https://github.com/raintree-technology/docpull/releases)
 - [Examples](https://github.com/raintree-technology/docpull/tree/main/examples)
 ## License

{docpull-1.2.1 → docpull-1.5.0}/README.md RENAMED Viewed

@@ -3,7 +3,11 @@
 **Pull documentation from any website and converts it into clean, AI-ready Markdown.**
 Fast, type-safe, secure, and optimized for building knowledge bases or training datasets.
-**NEW in v1.2.0**: 15 major features including language filtering, deduplication, auto-indexing, multi-source configuration, and more. Real-world testing shows **58% size reduction** with automatic optimization.
+**NEW in v1.5.0**: Proxy support, retry with exponential backoff, custom User-Agent, and mandatory robots.txt compliance for TOS-friendly scraping.
+**v1.3.0**: Rich structured metadata extraction (Open Graph, JSON-LD) for enhanced AI/RAG integration.
+**v1.2.0**: 15 major features including language filtering, deduplication, auto-indexing, multi-source configuration, and more. Real-world testing shows **58% size reduction** with automatic optimization.
 [![Python 3.9+](https://img.shields.io/badge/python-3.9+-blue.svg)](https://www.python.org/downloads/)
 [![PyPI version](https://badge.fury.io/py/docpull.svg)](https://badge.fury.io/py/docpull)
@@ -26,9 +30,22 @@ Unlike tools like wget or httrack, docpull extracts only the main content, remov
 - Sitemap + link crawling
 - Rate limiting, timeouts, content-type checks
 - Saves docs in structured Markdown with YAML metadata
-- Optimized profiles for popular platforms (Stripe, Next.js, React, Plaid, Tailwind, etc.)
-### NEW in v1.2.0: Advanced Optimization
+- **Mandatory robots.txt compliance** for TOS-friendly scraping
+### NEW in v1.5.0: Network & Reliability
+- **Proxy Support**: HTTP, HTTPS, and SOCKS5 proxies via `--proxy` or env vars
+- **Retry with Exponential Backoff**: Configurable retries for transient failures
+- **Custom User-Agent**: Set custom User-Agent strings for requests
+- **Crawl-delay Compliance**: Automatically respects robots.txt Crawl-delay directives
+- **Better Encoding Detection**: Intelligent charset detection for international docs
+### v1.3.0: Rich Metadata Extraction
+- **Structured Metadata**: Extract Open Graph, JSON-LD, and microdata during fetch
+- **Enhanced Frontmatter**: Adds author, description, keywords, images, publish dates, and more
+- **AI/RAG Ready**: Richer context for embeddings and retrieval systems
+- **Opt-in Feature**: Enabled with `--rich-metadata` flag
+### v1.2.0: Advanced Optimization
 - **Language Filtering**: Auto-detect and filter by language (skip 352+ translation files)
 - **Deduplication**: Remove duplicates with SHA-256 hashing (save 10+ MB on duplicate content)
 - **Auto-Index Generation**: Create navigable INDEX.md with tree/TOC/categories/stats
@@ -55,11 +72,14 @@ docpull --doctor         # verify installation
 # Basic usage
 docpull https://aptos.dev
-docpull stripe           # use a built-in profile
+docpull https://docs.anthropic.com
 # NEW: Simple optimization (v1.2.0)
 docpull https://code.claude.com/docs --language en --create-index
+# NEW: Rich metadata extraction (v1.3.0)
+docpull https://docs.anthropic.com --rich-metadata --create-index
 # NEW: Advanced optimization (v1.2.0)
 docpull https://aptos.dev \
   --deduplicate \
@@ -85,7 +105,7 @@ docpull https://site.com --js
 from docpull import GenericAsyncFetcher
 fetcher = GenericAsyncFetcher(
-    url_or_profile="https://aptos.dev",
+    url="https://aptos.dev",
     output_dir="./docs",
     max_pages=100,
     max_concurrent=20,
@@ -120,6 +140,7 @@ fetcher.fetch()
 - `--naming-strategy {full,short,flat,hierarchical}` – file naming strategy
 - `--create-index` – generate INDEX.md with navigation
 - `--extract-metadata` – extract metadata to metadata.json
+- `--rich-metadata` – extract rich structured metadata (Open Graph, JSON-LD) during fetch
 - `--update-only-changed` – only download changed files
 - `--incremental` – enable incremental mode with resume
 - `--git-commit` – auto-commit changes
@@ -128,6 +149,12 @@ fetcher.fetch()
 - `--archive-format {tar.gz,tar.bz2,tar.xz,zip}` – archive format
 - `--sources-file PATH` – multi-source configuration file
+### NEW in v1.5.0: Network Options
+- `--proxy URL` – proxy URL (HTTP, HTTPS, SOCKS5)
+- `--user-agent STRING` – custom User-Agent string
+- `--max-retries N` – max retry attempts for failed requests (default: 3)
+- `--retry-base-delay SECONDS` – base delay for exponential backoff (default: 1.0)
 See `docpull --help` for complete list of options.
 ## Performance
@@ -146,33 +173,36 @@ Each downloaded page becomes a Markdown file:
 ```markdown
 ---
-url: https://stripe.com/docs/payments
-fetched: 2025-11-13
+url: https://aptos.dev/build/guides/first-transaction
+fetched: 2025-11-28
 ---
-# Payment Intents
+# Your First Transaction
 ...
 ```
-Directory layout mirrors the target site's structure.
-## Configuration File
-### Simple Configuration (v1.0+)
+With `--rich-metadata`, the frontmatter includes Open Graph, JSON-LD, and other structured metadata:
-```yaml
-output_dir: ./docs
-rate_limit: 0.5
-sources:
-  - stripe
-  - nextjs
+```markdown
+---
+url: https://aptos.dev/build/guides/first-transaction
+fetched: 2025-11-28
+title: Your First Transaction
+description: Learn how to submit your first transaction on Aptos
+author: Aptos Foundation
+keywords: [aptos, blockchain, transaction, guide]
+image: https://aptos.dev/img/docs-preview.png
+type: article
+site_name: Aptos Documentation
+---
+# Your First Transaction
+...
 ```
-Run with:
-```bash
-docpull --config config.yaml
-```
+Directory layout mirrors the target site's structure.
+## Configuration File
-### NEW: Multi-Source Configuration (v1.2.0)
+### Multi-Source Configuration
 ```yaml
 sources:
@@ -181,6 +211,7 @@ sources:
     language: en
     max_file_size: 200kb
     create_index: true
+    rich_metadata: true  # Extract Open Graph, JSON-LD metadata
   claude-code:
     url: https://code.claude.com/docs
@@ -210,38 +241,27 @@ docpull --sources-file config.yaml
 See `examples/` directory for more configuration examples.
-## Custom Profiles
-Easily define profiles for frequently scraped sites.
-```python
-from docpull.profiles.base import SiteProfile
-MY_PROFILE = SiteProfile(
-    name="mysite",
-    domains={"docs.mysite.com"},
-    include_patterns=["/docs/", "/api/"],
-)
-```
 ## Security
-- HTTPS-only
-- Blocks private network IPs
+- HTTPS-only (HTTP rejected)
+- **Mandatory robots.txt compliance** (cannot be disabled)
+- Respects Crawl-delay directives
+- Blocks private/internal network IPs
 - 50MB page size limit
-- Timeout controls
-- Validates content-type
-- Playwright sandboxing
+- Timeout controls (30s connection, 5min download)
+- Validates content-type headers
+- Playwright sandboxing for JS rendering
+- Path traversal protection
 ## Troubleshooting
 - **Installation issues**: Run `docpull --doctor` to diagnose problems
-- **Missing dependencies**: See [TROUBLESHOOTING.md](TROUBLESHOOTING.md) for common fixes
-- **Site requires JS**: install Playwright + `--js`
-- **Slow or rate limited**: lower concurrency or raise `--rate-limit`
-- **Large sites**: set `--max-pages`
-For detailed troubleshooting, see [TROUBLESHOOTING.md](TROUBLESHOOTING.md).
+- **Missing dependencies**: `pip install docpull[all]` for all optional dependencies
+- **Site requires JS**: `pip install docpull[js]` then `python -m playwright install chromium`
+- **Slow or rate limited**: Lower `--max-concurrent` or raise `--rate-limit`
+- **Large sites**: Set `--max-pages` to limit crawl size
+- **Proxy issues**: Use `--proxy URL` or set `DOCPULL_PROXY` / `HTTPS_PROXY` env var
+- **Transient failures**: Increase `--max-retries` (default: 3)
 ## v1.2.0 Feature Examples
@@ -297,9 +317,65 @@ See `examples/` directory for comprehensive configuration examples.
 - **After**: 1,250 files, 13 MB (58% reduction), full indexes generated
 - **One command** instead of 4+ separate commands with manual optimization
+## What's New in v1.5.0
+This release focuses on network reliability, proxy support, and TOS compliance.
+**New Features**:
+- **Proxy Support**: HTTP, HTTPS, and SOCKS5 proxies
+  - Use `--proxy URL` or set `DOCPULL_PROXY` / `HTTPS_PROXY` environment variables
+  - Install SOCKS support: `pip install docpull[proxy]`
+- **Retry with Exponential Backoff**: Automatic retries for transient failures
+  - `--max-retries N` (default: 3)
+  - `--retry-base-delay SECONDS` (default: 1.0)
+  - Handles 429, 500, 502, 503, 504 status codes
+- **Custom User-Agent**: `--user-agent STRING` for custom identification
+- **Better Encoding Detection**: Intelligent charset detection using charset-normalizer
+- **Crawl-delay Compliance**: Automatically respects robots.txt Crawl-delay directives
+**Security Enhancement**:
+- **Mandatory robots.txt Compliance**: robots.txt is now always respected (cannot be disabled)
+  - Ensures TOS-friendly scraping behavior
+  - Automatically adjusts rate limiting based on Crawl-delay
+**Codebase Simplification**:
+- Removed built-in profiles (Stripe, etc.) - use URLs directly
+- Consolidated utility modules
+- Moved CONTRIBUTING.md, SECURITY.md to `.github/` directory
+**Backward Compatible**: All existing workflows continue to work unchanged.
+## What's New in v1.3.0
+This release adds rich structured metadata extraction for better AI/RAG integration.
+**New Feature**:
+- **Rich Metadata Extraction**: Extract Open Graph, JSON-LD, microdata, and other structured metadata during fetch
+  - Adds author, description, keywords, images, publish dates, and more to frontmatter
+  - Enhances AI/RAG systems with richer context
+  - Enabled with `--rich-metadata` flag or `rich_metadata: true` in config
+  - Powered by the extruct library
+**Example enhanced frontmatter**:
+```yaml
+---
+url: https://docs.example.com/guide
+fetched: 2025-11-20
+title: Getting Started Guide
+description: Learn the basics of our platform
+author: John Doe
+keywords: [tutorial, guide, api]
+image: https://docs.example.com/og-image.png
+type: article
+published_time: 2024-01-15T10:00:00Z
+---
+```
+**Backward Compatible**: All existing workflows continue to work unchanged. Rich metadata is opt-in.
 ## What's New in v1.2.0
-This release adds 15 major features across 4 phases. See [CHANGELOG.md](CHANGELOG.md) for complete release notes.
+This release adds 15 major features across 4 phases.
 **Highlights**:
 - Multi-source YAML configuration
@@ -318,7 +394,7 @@ This release adds 15 major features across 4 phases. See [CHANGELOG.md](CHANGELO
 - [PyPI](https://pypi.org/project/docpull/)
 - [GitHub](https://github.com/raintree-technology/docpull)
 - [Issues](https://github.com/raintree-technology/docpull/issues)
-- [Changelog](https://github.com/raintree-technology/docpull/blob/main/CHANGELOG.md)
+- [Releases](https://github.com/raintree-technology/docpull/releases)
 - [Examples](https://github.com/raintree-technology/docpull/tree/main/examples)
 ## License

docpull-1.5.0/docpull/__init__.py ADDED Viewed

@@ -0,0 +1,13 @@
+__version__ = "1.5.0"
+from .fetchers.base import BaseFetcher
+from .fetchers.generic import GenericFetcher
+from .fetchers.generic_async import GenericAsyncFetcher
+from .fetchers.parallel_base import ParallelFetcher
+__all__ = [
+    "BaseFetcher",
+    "GenericFetcher",
+    "GenericAsyncFetcher",
+    "ParallelFetcher",
+]

docpull 1.2.1__tar.gz → 1.5.0__tar.gz

docpull 1.2.1tar.gz → 1.5.0tar.gz