PyPI - docpull - Versions diffs - 1.2.1__tar.gz → 1.3.0__tar.gz - Mend

docpull 1.2.1tar.gz → 1.3.0tar.gz

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (83) hide show

{docpull-1.2.1 → docpull-1.3.0}/CHANGELOG.md RENAMED Viewed

@@ -5,6 +5,81 @@ All notable changes to this project will be documented in this file.
 The format is based on [Keep a Changelog](https://keepachangelog.com/en/1.0.0/),
 and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0.html).
+## [1.3.0] - 2025-11-20
+### Added
+**Rich Metadata Extraction**
+- Extract structured metadata (Open Graph, JSON-LD, microdata) during fetch
+- New `--rich-metadata` CLI flag to enable rich metadata extraction
+- Enhanced frontmatter with author, description, keywords, images, publish dates, tags, and more
+- Better context for AI/RAG systems with richer document metadata
+- Powered by `extruct` library
+- Opt-in feature, backward compatible with existing workflows
+### Changed
+**Simplified Profile System**
+- Removed 7 built-in profiles (Next.js, React, Plaid, Tailwind, Bun, D3, Turborepo)
+- Kept Stripe profile as reference implementation
+- Generic fetcher works excellently for all documentation sites
+- Users can create custom profiles or use URLs directly
+- Reduced maintenance burden and codebase complexity
+### Technical Details
+**New Dependencies:**
+- Added `extruct>=0.15.0` for structured metadata extraction
+**New Files:**
+- `docpull/metadata_extractor.py` - Rich metadata extraction module
+- `tests/test_metadata_extractor.py` - Comprehensive test suite for metadata extraction
+**Updated Files:**
+- `docpull/fetchers/base.py` - Integrated rich metadata extraction into fetch pipeline
+- `docpull/fetchers/generic_async.py` - Added `use_rich_metadata` parameter
+- `docpull/config.py` - Added `rich_metadata` configuration option
+- `docpull/sources_config.py` - Added `rich_metadata` field to SourceConfig
+- `docpull/cli.py` - Added `--rich-metadata` CLI flag
+- `docpull/profiles/__init__.py` - Simplified to single Stripe profile
+**Removed Files:**
+- Removed 7 profile files and 7 fetcher implementation files
+**Version Bump:**
+- Updated version from `1.2.1` to `1.3.0`
+### Example Usage
+```bash
+# Extract rich metadata during fetch
+docpull https://docs.anthropic.com --rich-metadata
+# Combine with other features
+docpull https://stripe.com/docs --rich-metadata --create-index --language en
+# Multi-source configuration
+docpull --sources-file config.yaml  # with rich_metadata: true per source
+```
+### Example Enhanced Frontmatter
+```yaml
+---
+url: https://docs.example.com/guide
+fetched: 2025-11-20
+title: Getting Started Guide
+description: Learn the basics of our platform
+author: John Doe
+keywords: [tutorial, guide, api]
+image: https://docs.example.com/og-image.png
+type: article
+site_name: Example Docs
+published_time: 2024-01-15T10:00:00Z
+modified_time: 2024-01-20T15:30:00Z
+---
+```
 ## [1.2.0] - 2025-11-16
 ### Added - 15 Major New Features

{docpull-1.2.1 → docpull-1.3.0}/PKG-INFO RENAMED Viewed

@@ -1,6 +1,6 @@
 Metadata-Version: 2.4
 Name: docpull
-Version: 1.2.1
+Version: 1.3.0
 Summary: Pull documentation from the web and convert to clean markdown
 Author-email: Zachary Roth <support@raintree.technology>
 Maintainer-email: Raintree Technology <support@raintree.technology>
@@ -43,6 +43,7 @@ Requires-Dist: requests>=2.31.0
 Requires-Dist: beautifulsoup4>=4.12.0
 Requires-Dist: html2text>=2020.1.16
 Requires-Dist: defusedxml>=0.7.1
+Requires-Dist: extruct>=0.15.0
 Requires-Dist: aiohttp>=3.9.0
 Requires-Dist: rich>=13.0.0
 Requires-Dist: pyyaml>=6.0
@@ -72,7 +73,9 @@ Dynamic: license-file
 **Pull documentation from any website and converts it into clean, AI-ready Markdown.**
 Fast, type-safe, secure, and optimized for building knowledge bases or training datasets.
-**NEW in v1.2.0**: 15 major features including language filtering, deduplication, auto-indexing, multi-source configuration, and more. Real-world testing shows **58% size reduction** with automatic optimization.
+**NEW in v1.3.0**: Rich structured metadata extraction (Open Graph, JSON-LD) for enhanced AI/RAG integration.
+**v1.2.0**: 15 major features including language filtering, deduplication, auto-indexing, multi-source configuration, and more. Real-world testing shows **58% size reduction** with automatic optimization.
 [![Python 3.9+](https://img.shields.io/badge/python-3.9+-blue.svg)](https://www.python.org/downloads/)
 [![PyPI version](https://badge.fury.io/py/docpull.svg)](https://badge.fury.io/py/docpull)
@@ -95,9 +98,15 @@ Unlike tools like wget or httrack, docpull extracts only the main content, remov
 - Sitemap + link crawling
 - Rate limiting, timeouts, content-type checks
 - Saves docs in structured Markdown with YAML metadata
-- Optimized profiles for popular platforms (Stripe, Next.js, React, Plaid, Tailwind, etc.)
+- Built-in Stripe profile as reference implementation (custom profiles easily added)
+### NEW in v1.3.0: Rich Metadata Extraction
+- **Structured Metadata**: Extract Open Graph, JSON-LD, and microdata during fetch
+- **Enhanced Frontmatter**: Adds author, description, keywords, images, publish dates, and more
+- **AI/RAG Ready**: Richer context for embeddings and retrieval systems
+- **Opt-in Feature**: Enabled with `--rich-metadata` flag
-### NEW in v1.2.0: Advanced Optimization
+### v1.2.0: Advanced Optimization
 - **Language Filtering**: Auto-detect and filter by language (skip 352+ translation files)
 - **Deduplication**: Remove duplicates with SHA-256 hashing (save 10+ MB on duplicate content)
 - **Auto-Index Generation**: Create navigable INDEX.md with tree/TOC/categories/stats
@@ -129,6 +138,9 @@ docpull stripe           # use a built-in profile
 # NEW: Simple optimization (v1.2.0)
 docpull https://code.claude.com/docs --language en --create-index
+# NEW: Rich metadata extraction (v1.3.0)
+docpull https://docs.anthropic.com --rich-metadata --create-index
 # NEW: Advanced optimization (v1.2.0)
 docpull https://aptos.dev \
   --deduplicate \
@@ -189,6 +201,7 @@ fetcher.fetch()
 - `--naming-strategy {full,short,flat,hierarchical}` – file naming strategy
 - `--create-index` – generate INDEX.md with navigation
 - `--extract-metadata` – extract metadata to metadata.json
+- `--rich-metadata` – extract rich structured metadata (Open Graph, JSON-LD) during fetch
 - `--update-only-changed` – only download changed files
 - `--incremental` – enable incremental mode with resume
 - `--git-commit` – auto-commit changes
@@ -222,6 +235,24 @@ fetched: 2025-11-13
 ...
 ```
+With `--rich-metadata`, the frontmatter includes Open Graph, JSON-LD, and other structured metadata:
+```markdown
+---
+url: https://stripe.com/docs/payments
+fetched: 2025-11-13
+title: Accept a payment
+description: Learn how to accept payments with the Payment Intents API
+author: Stripe
+keywords: [payments, api, stripe, checkout]
+image: https://stripe.com/img/docs-preview.png
+type: article
+site_name: Stripe Documentation
+---
+# Payment Intents
+...
+```
 Directory layout mirrors the target site's structure.
 ## Configuration File
@@ -232,8 +263,8 @@ Directory layout mirrors the target site's structure.
 output_dir: ./docs
 rate_limit: 0.5
 sources:
-  - stripe
-  - nextjs
+  - stripe  # Built-in profile
+  - https://docs.example.com  # Or any URL
 ```
 Run with:
@@ -250,6 +281,7 @@ sources:
     language: en
     max_file_size: 200kb
     create_index: true
+    rich_metadata: true  # Extract Open Graph, JSON-LD metadata
   claude-code:
     url: https://code.claude.com/docs
@@ -281,7 +313,7 @@ See `examples/` directory for more configuration examples.
 ## Custom Profiles
-Easily define profiles for frequently scraped sites.
+docpull includes a Stripe profile as reference. Create custom profiles for other sites:
 ```python
 from docpull.profiles.base import SiteProfile
@@ -290,9 +322,13 @@ MY_PROFILE = SiteProfile(
     name="mysite",
     domains={"docs.mysite.com"},
     include_patterns=["/docs/", "/api/"],
+    sitemap_url="https://docs.mysite.com/sitemap.xml",
+    rate_limit=0.5,
 )
 ```
+**Want to contribute profiles?** Submit a PR with your custom profile! Popular ones may be added to the core or a community profiles repository.
 ## Security
 - HTTPS-only
@@ -366,6 +402,34 @@ See `examples/` directory for comprehensive configuration examples.
 - **After**: 1,250 files, 13 MB (58% reduction), full indexes generated
 - **One command** instead of 4+ separate commands with manual optimization
+## What's New in v1.3.0
+This release adds rich structured metadata extraction for better AI/RAG integration.
+**New Feature**:
+- **Rich Metadata Extraction**: Extract Open Graph, JSON-LD, microdata, and other structured metadata during fetch
+  - Adds author, description, keywords, images, publish dates, and more to frontmatter
+  - Enhances AI/RAG systems with richer context
+  - Enabled with `--rich-metadata` flag or `rich_metadata: true` in config
+  - Powered by the extruct library
+**Example enhanced frontmatter**:
+```yaml
+---
+url: https://docs.example.com/guide
+fetched: 2025-11-20
+title: Getting Started Guide
+description: Learn the basics of our platform
+author: John Doe
+keywords: [tutorial, guide, api]
+image: https://docs.example.com/og-image.png
+type: article
+published_time: 2024-01-15T10:00:00Z
+---
+```
+**Backward Compatible**: All existing workflows continue to work unchanged. Rich metadata is opt-in.
 ## What's New in v1.2.0
 This release adds 15 major features across 4 phases. See [CHANGELOG.md](CHANGELOG.md) for complete release notes.

{docpull-1.2.1 → docpull-1.3.0}/README.md RENAMED Viewed

@@ -3,7 +3,9 @@
 **Pull documentation from any website and converts it into clean, AI-ready Markdown.**
 Fast, type-safe, secure, and optimized for building knowledge bases or training datasets.
-**NEW in v1.2.0**: 15 major features including language filtering, deduplication, auto-indexing, multi-source configuration, and more. Real-world testing shows **58% size reduction** with automatic optimization.
+**NEW in v1.3.0**: Rich structured metadata extraction (Open Graph, JSON-LD) for enhanced AI/RAG integration.
+**v1.2.0**: 15 major features including language filtering, deduplication, auto-indexing, multi-source configuration, and more. Real-world testing shows **58% size reduction** with automatic optimization.
 [![Python 3.9+](https://img.shields.io/badge/python-3.9+-blue.svg)](https://www.python.org/downloads/)
 [![PyPI version](https://badge.fury.io/py/docpull.svg)](https://badge.fury.io/py/docpull)
@@ -26,9 +28,15 @@ Unlike tools like wget or httrack, docpull extracts only the main content, remov
 - Sitemap + link crawling
 - Rate limiting, timeouts, content-type checks
 - Saves docs in structured Markdown with YAML metadata
-- Optimized profiles for popular platforms (Stripe, Next.js, React, Plaid, Tailwind, etc.)
+- Built-in Stripe profile as reference implementation (custom profiles easily added)
+### NEW in v1.3.0: Rich Metadata Extraction
+- **Structured Metadata**: Extract Open Graph, JSON-LD, and microdata during fetch
+- **Enhanced Frontmatter**: Adds author, description, keywords, images, publish dates, and more
+- **AI/RAG Ready**: Richer context for embeddings and retrieval systems
+- **Opt-in Feature**: Enabled with `--rich-metadata` flag
-### NEW in v1.2.0: Advanced Optimization
+### v1.2.0: Advanced Optimization
 - **Language Filtering**: Auto-detect and filter by language (skip 352+ translation files)
 - **Deduplication**: Remove duplicates with SHA-256 hashing (save 10+ MB on duplicate content)
 - **Auto-Index Generation**: Create navigable INDEX.md with tree/TOC/categories/stats
@@ -60,6 +68,9 @@ docpull stripe           # use a built-in profile
 # NEW: Simple optimization (v1.2.0)
 docpull https://code.claude.com/docs --language en --create-index
+# NEW: Rich metadata extraction (v1.3.0)
+docpull https://docs.anthropic.com --rich-metadata --create-index
 # NEW: Advanced optimization (v1.2.0)
 docpull https://aptos.dev \
   --deduplicate \
@@ -120,6 +131,7 @@ fetcher.fetch()
 - `--naming-strategy {full,short,flat,hierarchical}` – file naming strategy
 - `--create-index` – generate INDEX.md with navigation
 - `--extract-metadata` – extract metadata to metadata.json
+- `--rich-metadata` – extract rich structured metadata (Open Graph, JSON-LD) during fetch
 - `--update-only-changed` – only download changed files
 - `--incremental` – enable incremental mode with resume
 - `--git-commit` – auto-commit changes
@@ -153,6 +165,24 @@ fetched: 2025-11-13
 ...
 ```
+With `--rich-metadata`, the frontmatter includes Open Graph, JSON-LD, and other structured metadata:
+```markdown
+---
+url: https://stripe.com/docs/payments
+fetched: 2025-11-13
+title: Accept a payment
+description: Learn how to accept payments with the Payment Intents API
+author: Stripe
+keywords: [payments, api, stripe, checkout]
+image: https://stripe.com/img/docs-preview.png
+type: article
+site_name: Stripe Documentation
+---
+# Payment Intents
+...
+```
 Directory layout mirrors the target site's structure.
 ## Configuration File
@@ -163,8 +193,8 @@ Directory layout mirrors the target site's structure.
 output_dir: ./docs
 rate_limit: 0.5
 sources:
-  - stripe
-  - nextjs
+  - stripe  # Built-in profile
+  - https://docs.example.com  # Or any URL
 ```
 Run with:
@@ -181,6 +211,7 @@ sources:
     language: en
     max_file_size: 200kb
     create_index: true
+    rich_metadata: true  # Extract Open Graph, JSON-LD metadata
   claude-code:
     url: https://code.claude.com/docs
@@ -212,7 +243,7 @@ See `examples/` directory for more configuration examples.
 ## Custom Profiles
-Easily define profiles for frequently scraped sites.
+docpull includes a Stripe profile as reference. Create custom profiles for other sites:
 ```python
 from docpull.profiles.base import SiteProfile
@@ -221,9 +252,13 @@ MY_PROFILE = SiteProfile(
     name="mysite",
     domains={"docs.mysite.com"},
     include_patterns=["/docs/", "/api/"],
+    sitemap_url="https://docs.mysite.com/sitemap.xml",
+    rate_limit=0.5,
 )
 ```
+**Want to contribute profiles?** Submit a PR with your custom profile! Popular ones may be added to the core or a community profiles repository.
 ## Security
 - HTTPS-only
@@ -297,6 +332,34 @@ See `examples/` directory for comprehensive configuration examples.
 - **After**: 1,250 files, 13 MB (58% reduction), full indexes generated
 - **One command** instead of 4+ separate commands with manual optimization
+## What's New in v1.3.0
+This release adds rich structured metadata extraction for better AI/RAG integration.
+**New Feature**:
+- **Rich Metadata Extraction**: Extract Open Graph, JSON-LD, microdata, and other structured metadata during fetch
+  - Adds author, description, keywords, images, publish dates, and more to frontmatter
+  - Enhances AI/RAG systems with richer context
+  - Enabled with `--rich-metadata` flag or `rich_metadata: true` in config
+  - Powered by the extruct library
+**Example enhanced frontmatter**:
+```yaml
+---
+url: https://docs.example.com/guide
+fetched: 2025-11-20
+title: Getting Started Guide
+description: Learn the basics of our platform
+author: John Doe
+keywords: [tutorial, guide, api]
+image: https://docs.example.com/og-image.png
+type: article
+published_time: 2024-01-15T10:00:00Z
+---
+```
+**Backward Compatible**: All existing workflows continue to work unchanged. Rich metadata is opt-in.
 ## What's New in v1.2.0
 This release adds 15 major features across 4 phases. See [CHANGELOG.md](CHANGELOG.md) for complete release notes.

docpull-1.3.0/docpull/__init__.py ADDED Viewed

@@ -0,0 +1,15 @@
+__version__ = "1.3.0"
+from .fetchers.base import BaseFetcher
+from .fetchers.generic import GenericFetcher
+from .fetchers.generic_async import GenericAsyncFetcher
+from .fetchers.parallel_base import ParallelFetcher
+from .fetchers.stripe import StripeFetcher
+__all__ = [
+    "BaseFetcher",
+    "GenericFetcher",
+    "GenericAsyncFetcher",
+    "ParallelFetcher",
+    "StripeFetcher",
+]

{docpull-1.2.1 → docpull-1.3.0}/docpull/cli.py RENAMED Viewed

@@ -295,13 +295,18 @@ Examples:
     )
     # Index Generation
-    index_group = parser.add_argument_group("index generation")
+    index_group = parser.add_argument_group("index generation & metadata")
     index_group.add_argument(
         "--create-index", action="store_true", help="Create INDEX.md with file tree and navigation"
     )
     index_group.add_argument(
         "--extract-metadata", action="store_true", help="Extract metadata to metadata.json"
     )
+    index_group.add_argument(
+        "--rich-metadata",
+        action="store_true",
+        help="Extract rich structured metadata (Open Graph, JSON-LD) during fetch",
+    )
     # Update Detection
     cache_group = parser.add_argument_group("update detection & caching")
@@ -635,6 +640,7 @@ def run_generic_fetchers(args: argparse.Namespace) -> int:
                 max_concurrent=max_concurrent,
                 use_js=use_js,
                 show_progress=show_progress,
+                use_rich_metadata=args.rich_metadata,
             )
             fetcher.fetch()  # This calls asyncio.run() internally
@@ -741,6 +747,7 @@ def run_multi_source_fetch(args: argparse.Namespace) -> int:
                 max_concurrent=source_config.max_concurrent or 10,
                 use_js=source_config.javascript,
                 show_progress=True,
+                use_rich_metadata=source_config.rich_metadata or False,
             )
             # Fetch

{docpull-1.2.1 → docpull-1.3.0}/docpull/config.py RENAMED Viewed

@@ -34,6 +34,7 @@ class FetcherConfig:
         naming_strategy: str = "full",
         create_index: bool = False,
         extract_metadata: bool = False,
+        rich_metadata: bool = False,
         update_only_changed: bool = False,
         incremental: bool = False,
         cache_dir: str = ".docpull-cache",
@@ -52,7 +53,7 @@ class FetcherConfig:
             skip_existing: Skip existing files
             log_level: Logging level
             log_file: Optional log file path
-            sources: List of sources to fetch (e.g., ['stripe', 'plaid'])
+            sources: List of sources to fetch (profile names or URLs, e.g., ['stripe', 'https://docs.example.com'])
             dry_run: Dry run mode (don't download files)
             language: Include only this language (e.g., 'en')
             exclude_languages: Exclude these languages
@@ -67,6 +68,7 @@ class FetcherConfig:
             naming_strategy: File naming strategy (full, short, flat, hierarchical)
             create_index: Create INDEX.md with navigation
             extract_metadata: Extract metadata to metadata.json
+            rich_metadata: Extract rich structured metadata (Open Graph, JSON-LD) during fetch
             update_only_changed: Only download changed files
             incremental: Enable incremental mode
             cache_dir: Cache directory for update detection
@@ -81,7 +83,7 @@ class FetcherConfig:
         self.skip_existing = skip_existing
         self.log_level = log_level
         self.log_file = log_file
-        self.sources = sources or ["plaid", "stripe"]
+        self.sources = sources or ["stripe"]
         self.dry_run = dry_run
         # v1.2.0 features
@@ -98,6 +100,7 @@ class FetcherConfig:
         self.naming_strategy = naming_strategy
         self.create_index = create_index
         self.extract_metadata = extract_metadata
+        self.rich_metadata = rich_metadata
         self.update_only_changed = update_only_changed
         self.incremental = incremental
         self.cache_dir = Path(cache_dir)
@@ -131,11 +134,13 @@ class FetcherConfig:
         if not isinstance(rate_limit, (int, float)) or rate_limit < 0 or rate_limit > 60:
             raise ValueError("rate_limit must be between 0 and 60")
-        # Validate sources
-        valid_sources = {"bun", "d3", "nextjs", "plaid", "react", "stripe", "tailwind", "turborepo"}
-        sources = config_dict.get("sources", ["plaid", "stripe"])
-        if not all(s in valid_sources for s in sources):
-            raise ValueError(f"Invalid sources. Must be from: {valid_sources}")
+        # Validate sources (built-in profiles or URLs)
+        valid_sources = {"stripe"}
+        sources = config_dict.get("sources", ["stripe"])
+        # Allow URLs or valid profile names
+        for source in sources:
+            if not (source in valid_sources or source.startswith("http://") or source.startswith("https://")):
+                raise ValueError(f"Invalid source: {source}. Must be 'stripe' or a URL")
         # Validate log_level
         log_level = config_dict.get("log_level", "INFO")

docpull-1.3.0/docpull/fetchers/__init__.py ADDED Viewed

@@ -0,0 +1,9 @@
+from .base import BaseFetcher
+from .parallel_base import ParallelFetcher
+from .stripe import StripeFetcher
+__all__ = [
+    "BaseFetcher",
+    "ParallelFetcher",
+    "StripeFetcher",
+]

{docpull-1.2.1 → docpull-1.3.0}/docpull/fetchers/base.py RENAMED Viewed

@@ -59,10 +59,12 @@ class BaseFetcher(ABC):
         skip_existing: bool = True,
         logger: Optional[logging.Logger] = None,
         allowed_domains: Optional[set[str]] = None,
+        use_rich_metadata: bool = False,
     ) -> None:
         self.output_dir = Path(output_dir).resolve()
         self.rate_limit = rate_limit
         self.skip_existing = skip_existing
+        self.use_rich_metadata = use_rich_metadata
         self.logger = logger or logging.getLogger(f"{__name__}.{self.__class__.__name__}")
         self.allowed_domains = allowed_domains
         self.h2t = html2text.HTML2Text()
@@ -98,6 +100,14 @@ class BaseFetcher(ABC):
         if user_agent is None:
             user_agent = "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36"
         self.session.headers.update({"User-Agent": user_agent})
+        # Initialize rich metadata extractor if enabled
+        self.rich_metadata_extractor = None
+        if self.use_rich_metadata:
+            from ..metadata_extractor import RichMetadataExtractor
+            self.rich_metadata_extractor = RichMetadataExtractor()
         self.stats: FetcherStats = {
             "fetched": 0,
             "skipped": 0,
@@ -358,6 +368,14 @@ class BaseFetcher(ABC):
             soup = BeautifulSoup(content, "html.parser")
+            # Extract rich metadata if enabled
+            rich_meta = None
+            if self.use_rich_metadata and self.rich_metadata_extractor:
+                try:
+                    rich_meta = self.rich_metadata_extractor.extract(content.decode("utf-8"), url)
+                except Exception as e:
+                    self.logger.debug(f"Rich metadata extraction failed for {url}: {e}")
             for element in soup(["script", "style", "nav", "footer", "header"]):
                 element.decompose()
             main_content = (
@@ -369,12 +387,47 @@ class BaseFetcher(ABC):
             if main_content:
                 markdown = self.h2t.handle(str(main_content))
-                frontmatter = f"""---
-url: {url}
-fetched: {time.strftime('%Y-%m-%d')}
----
-"""
+                # Build frontmatter with optional rich metadata
+                frontmatter_parts = [
+                    "---",
+                    f"url: {url}",
+                    f"fetched: {time.strftime('%Y-%m-%d')}",
+                ]
+                if rich_meta:
+                    # Add rich metadata fields if available
+                    if rich_meta.get("title"):
+                        frontmatter_parts.append(f"title: {rich_meta['title']}")
+                    if rich_meta.get("description"):
+                        # Escape any colons in description
+                        desc = str(rich_meta["description"]).replace(":", "\\:")
+                        frontmatter_parts.append(f"description: {desc}")
+                    if rich_meta.get("author"):
+                        frontmatter_parts.append(f"author: {rich_meta['author']}")
+                    if rich_meta.get("keywords"):
+                        keywords_str = ", ".join(rich_meta["keywords"])
+                        frontmatter_parts.append(f"keywords: [{keywords_str}]")
+                    if rich_meta.get("image"):
+                        frontmatter_parts.append(f"image: {rich_meta['image']}")
+                    if rich_meta.get("type"):
+                        frontmatter_parts.append(f"type: {rich_meta['type']}")
+                    if rich_meta.get("site_name"):
+                        frontmatter_parts.append(f"site_name: {rich_meta['site_name']}")
+                    if rich_meta.get("published_time"):
+                        frontmatter_parts.append(f"published_time: {rich_meta['published_time']}")
+                    if rich_meta.get("modified_time"):
+                        frontmatter_parts.append(f"modified_time: {rich_meta['modified_time']}")
+                    if rich_meta.get("section"):
+                        frontmatter_parts.append(f"section: {rich_meta['section']}")
+                    if rich_meta.get("tags"):
+                        tags_str = ", ".join(rich_meta["tags"])
+                        frontmatter_parts.append(f"tags: [{tags_str}]")
+                frontmatter_parts.append("---")
+                frontmatter_parts.append("")  # Empty line after frontmatter
+                frontmatter = "\n".join(frontmatter_parts)
                 return frontmatter + markdown.strip()
             else:
                 return f"# Error\n\nCould not find main content for {url}"

{docpull-1.2.1 → docpull-1.3.0}/docpull/fetchers/generic_async.py RENAMED Viewed

@@ -38,6 +38,7 @@ class GenericAsyncFetcher(BaseFetcher):
         max_concurrent: int = 10,
         use_js: bool = False,
         show_progress: bool = True,
+        use_rich_metadata: bool = False,
     ) -> None:
         """
         Initialize async generic fetcher.
@@ -54,8 +55,15 @@ class GenericAsyncFetcher(BaseFetcher):
             max_concurrent: Maximum concurrent requests
             use_js: Enable JavaScript rendering (requires playwright)
             show_progress: Show progress bars
+            use_rich_metadata: Extract rich structured metadata (Open Graph, JSON-LD)
         """
-        super().__init__(output_dir, rate_limit, skip_existing=skip_existing, logger=logger)
+        super().__init__(
+            output_dir,
+            rate_limit,
+            skip_existing=skip_existing,
+            logger=logger,
+            use_rich_metadata=use_rich_metadata,
+        )
         # Determine if input is a URL or profile name
         if url_or_profile.startswith(("http://", "https://")):

docpull 1.2.1__tar.gz → 1.3.0__tar.gz

docpull 1.2.1tar.gz → 1.3.0tar.gz