PyPI - spiderforce4ai - Versions diffs - 1.1__tar.gz → 1.3__tar.gz - Mend

spiderforce4ai 1.1tar.gz → 1.3tar.gz

Files changed (14) hide show

spiderforce4ai-1.3/PKG-INFO ADDED Viewed

@@ -0,0 +1,298 @@
+Metadata-Version: 2.2
+Name: spiderforce4ai
+Version: 1.3
+Summary: Python wrapper for SpiderForce4AI HTML-to-Markdown conversion service
+Home-page: https://petertam.pro
+Author: Piotr Tamulewicz
+Author-email: Piotr Tamulewicz <pt@petertam.pro>
+License: MIT
+Classifier: Development Status :: 4 - Beta
+Classifier: Intended Audience :: Developers
+Classifier: License :: OSI Approved :: MIT License
+Classifier: Programming Language :: Python :: 3.11
+Classifier: Programming Language :: Python :: 3.12
+Requires-Python: >=3.11
+Description-Content-Type: text/markdown
+Requires-Dist: aiohttp>=3.8.0
+Requires-Dist: asyncio>=3.4.3
+Requires-Dist: rich>=10.0.0
+Requires-Dist: aiofiles>=0.8.0
+Requires-Dist: httpx>=0.24.0
+Dynamic: author
+Dynamic: home-page
+Dynamic: requires-python
+# SpiderForce4AI Python Wrapper
+A Python package for web content crawling and HTML-to-Markdown conversion. Built for seamless integration with SpiderForce4AI service.
+## Features
+- HTML to Markdown conversion
+- Parallel and async crawling support
+- Sitemap processing
+- Custom content selection
+- Automatic retry mechanism
+- Detailed progress tracking
+- Webhook notifications
+- Customizable reporting
+## Installation
+```bash
+pip install spiderforce4ai
+```
+## Quick Start
+```python
+from spiderforce4ai import SpiderForce4AI, CrawlConfig
+from pathlib import Path
+# Initialize crawler
+spider = SpiderForce4AI("http://localhost:3004")
+# Configure crawling options
+config = CrawlConfig(
+    target_selector="article",
+    remove_selectors=[".ads", ".navigation"],
+    max_concurrent_requests=5,
+    save_reports=True
+)
+# Crawl a sitemap
+results = spider.crawl_sitemap_server_parallel("https://example.com/sitemap.xml", config)
+```
+## Key Features
+### 1. Smart Retry Mechanism
+- Automatically retries failed URLs
+- Monitors failure ratio to prevent server overload
+- Detailed retry statistics and progress tracking
+- Aborts retries if failure rate exceeds 20%
+```python
+# Retry behavior is automatic
+config = CrawlConfig(
+    max_concurrent_requests=5,
+    request_delay=1.0  # Delay between retries
+)
+results = spider.crawl_urls_async(urls, config)
+```
+### 2. Custom Webhook Integration
+- Flexible payload formatting
+- Custom headers support
+- Variable substitution in templates
+```python
+config = CrawlConfig(
+    webhook_url="https://your-webhook.com",
+    webhook_headers={
+        "Authorization": "Bearer token",
+        "X-Custom-Header": "value"
+    },
+    webhook_payload_template='''{
+        "url": "{url}",
+        "content": "{markdown}",
+        "status": "{status}",
+        "custom_field": "value"
+    }'''
+)
+```
+### 3. Flexible Report Generation
+- Optional report saving
+- Customizable report location
+- Detailed success/failure statistics
+```python
+config = CrawlConfig(
+    save_reports=True,
+    report_file=Path("custom_report.json"),
+    output_dir=Path("content")
+)
+```
+## Crawling Methods
+### 1. Single URL Processing
+```python
+# Synchronous
+result = spider.crawl_url("https://example.com", config)
+# Asynchronous
+async def crawl():
+    result = await spider.crawl_url_async("https://example.com", config)
+```
+### 2. Multiple URLs
+```python
+urls = ["https://example.com/page1", "https://example.com/page2"]
+# Server-side parallel (recommended)
+results = spider.crawl_urls_server_parallel(urls, config)
+# Client-side parallel
+results = spider.crawl_urls_parallel(urls, config)
+# Asynchronous
+async def crawl():
+    results = await spider.crawl_urls_async(urls, config)
+```
+### 3. Sitemap Processing
+```python
+# Server-side parallel (recommended)
+results = spider.crawl_sitemap_server_parallel("https://example.com/sitemap.xml", config)
+# Client-side parallel
+results = spider.crawl_sitemap_parallel("https://example.com/sitemap.xml", config)
+# Asynchronous
+async def crawl():
+    results = await spider.crawl_sitemap_async("https://example.com/sitemap.xml", config)
+```
+## Configuration Options
+```python
+config = CrawlConfig(
+    # Content Selection
+    target_selector="article",              # Target element to extract
+    remove_selectors=[".ads", "#popup"],    # Elements to remove
+    remove_selectors_regex=["modal-\\d+"],  # Regex patterns for removal
+    # Processing
+    max_concurrent_requests=5,              # Parallel processing limit
+    request_delay=0.5,                      # Delay between requests
+    timeout=30,                             # Request timeout
+    # Output
+    output_dir=Path("content"),             # Output directory
+    save_reports=False,                     # Enable/disable report saving
+    report_file=Path("report.json"),        # Report location
+    # Webhook
+    webhook_url="https://webhook.com",      # Webhook endpoint
+    webhook_timeout=10,                     # Webhook timeout
+    webhook_headers={                       # Custom headers
+        "Authorization": "Bearer token"
+    },
+    webhook_payload_template='''            # Custom payload format
+    {
+        "url": "{url}",
+        "content": "{markdown}",
+        "status": "{status}",
+        "error": "{error}",
+        "time": "{timestamp}"
+    }'''
+)
+```
+## Progress Tracking
+The package provides detailed progress information:
+```
+Fetching sitemap from https://example.com/sitemap.xml...
+Found 156 URLs in sitemap
+[━━━━━━━━━━━━━━━━━━━━━━━━━━━━] 100% • 156/156 URLs
+Retrying failed URLs: 18 (11.5% failed)
+[━━━━━━━━━━━━━━━━━━━━━━━━━━━━] 100% • 18/18 retries
+Crawling Summary:
+Total URLs processed: 156
+Initial failures: 18 (11.5%)
+Final results:
+  ✓ Successful: 150
+  ✗ Failed: 6
+Retry success rate: 12/18 (66.7%)
+```
+## Output Structure
+### 1. Directory Layout
+```
+content/                    # Output directory
+├── example-com-page1.md   # Markdown files
+├── example-com-page2.md
+└── report.json            # Crawl report
+```
+### 2. Report Format
+```json
+{
+  "timestamp": "2025-02-15T10:30:00",
+  "config": {
+    "target_selector": "article",
+    "remove_selectors": [".ads"]
+  },
+  "results": {
+    "successful": [...],
+    "failed": [...]
+  },
+  "summary": {
+    "total": 156,
+    "successful": 150,
+    "failed": 6
+  }
+}
+```
+## Performance Optimization
+1. Server-side Parallel Processing
+   - Recommended for most cases
+   - Single HTTP request
+   - Reduced network overhead
+   - Built-in load balancing
+2. Client-side Parallel Processing
+   - Better control over processing
+   - Customizable concurrency
+   - Progress tracking per URL
+   - Automatic retry handling
+3. Asynchronous Processing
+   - Ideal for async applications
+   - Non-blocking operation
+   - Real-time progress updates
+   - Efficient resource usage
+## Error Handling
+The package provides comprehensive error handling:
+- Automatic retry for failed URLs
+- Failure ratio monitoring
+- Detailed error reporting
+- Webhook error notifications
+- Progress tracking during retries
+## Requirements
+- Python 3.11+
+- Running SpiderForce4AI service
+- Internet connection
+## Dependencies
+- aiohttp
+- asyncio
+- rich
+- aiofiles
+- httpx
+## License
+MIT License
+## Credits
+Created by [Peter Tam](https://petertam.pro)

spiderforce4ai-1.3/README.md ADDED Viewed

@@ -0,0 +1,274 @@
+# SpiderForce4AI Python Wrapper
+A Python package for web content crawling and HTML-to-Markdown conversion. Built for seamless integration with SpiderForce4AI service.
+## Features
+- HTML to Markdown conversion
+- Parallel and async crawling support
+- Sitemap processing
+- Custom content selection
+- Automatic retry mechanism
+- Detailed progress tracking
+- Webhook notifications
+- Customizable reporting
+## Installation
+```bash
+pip install spiderforce4ai
+```
+## Quick Start
+```python
+from spiderforce4ai import SpiderForce4AI, CrawlConfig
+from pathlib import Path
+# Initialize crawler
+spider = SpiderForce4AI("http://localhost:3004")
+# Configure crawling options
+config = CrawlConfig(
+    target_selector="article",
+    remove_selectors=[".ads", ".navigation"],
+    max_concurrent_requests=5,
+    save_reports=True
+)
+# Crawl a sitemap
+results = spider.crawl_sitemap_server_parallel("https://example.com/sitemap.xml", config)
+```
+## Key Features
+### 1. Smart Retry Mechanism
+- Automatically retries failed URLs
+- Monitors failure ratio to prevent server overload
+- Detailed retry statistics and progress tracking
+- Aborts retries if failure rate exceeds 20%
+```python
+# Retry behavior is automatic
+config = CrawlConfig(
+    max_concurrent_requests=5,
+    request_delay=1.0  # Delay between retries
+)
+results = spider.crawl_urls_async(urls, config)
+```
+### 2. Custom Webhook Integration
+- Flexible payload formatting
+- Custom headers support
+- Variable substitution in templates
+```python
+config = CrawlConfig(
+    webhook_url="https://your-webhook.com",
+    webhook_headers={
+        "Authorization": "Bearer token",
+        "X-Custom-Header": "value"
+    },
+    webhook_payload_template='''{
+        "url": "{url}",
+        "content": "{markdown}",
+        "status": "{status}",
+        "custom_field": "value"
+    }'''
+)
+```
+### 3. Flexible Report Generation
+- Optional report saving
+- Customizable report location
+- Detailed success/failure statistics
+```python
+config = CrawlConfig(
+    save_reports=True,
+    report_file=Path("custom_report.json"),
+    output_dir=Path("content")
+)
+```
+## Crawling Methods
+### 1. Single URL Processing
+```python
+# Synchronous
+result = spider.crawl_url("https://example.com", config)
+# Asynchronous
+async def crawl():
+    result = await spider.crawl_url_async("https://example.com", config)
+```
+### 2. Multiple URLs
+```python
+urls = ["https://example.com/page1", "https://example.com/page2"]
+# Server-side parallel (recommended)
+results = spider.crawl_urls_server_parallel(urls, config)
+# Client-side parallel
+results = spider.crawl_urls_parallel(urls, config)
+# Asynchronous
+async def crawl():
+    results = await spider.crawl_urls_async(urls, config)
+```
+### 3. Sitemap Processing
+```python
+# Server-side parallel (recommended)
+results = spider.crawl_sitemap_server_parallel("https://example.com/sitemap.xml", config)
+# Client-side parallel
+results = spider.crawl_sitemap_parallel("https://example.com/sitemap.xml", config)
+# Asynchronous
+async def crawl():
+    results = await spider.crawl_sitemap_async("https://example.com/sitemap.xml", config)
+```
+## Configuration Options
+```python
+config = CrawlConfig(
+    # Content Selection
+    target_selector="article",              # Target element to extract
+    remove_selectors=[".ads", "#popup"],    # Elements to remove
+    remove_selectors_regex=["modal-\\d+"],  # Regex patterns for removal
+    # Processing
+    max_concurrent_requests=5,              # Parallel processing limit
+    request_delay=0.5,                      # Delay between requests
+    timeout=30,                             # Request timeout
+    # Output
+    output_dir=Path("content"),             # Output directory
+    save_reports=False,                     # Enable/disable report saving
+    report_file=Path("report.json"),        # Report location
+    # Webhook
+    webhook_url="https://webhook.com",      # Webhook endpoint
+    webhook_timeout=10,                     # Webhook timeout
+    webhook_headers={                       # Custom headers
+        "Authorization": "Bearer token"
+    },
+    webhook_payload_template='''            # Custom payload format
+    {
+        "url": "{url}",
+        "content": "{markdown}",
+        "status": "{status}",
+        "error": "{error}",
+        "time": "{timestamp}"
+    }'''
+)
+```
+## Progress Tracking
+The package provides detailed progress information:
+```
+Fetching sitemap from https://example.com/sitemap.xml...
+Found 156 URLs in sitemap
+[━━━━━━━━━━━━━━━━━━━━━━━━━━━━] 100% • 156/156 URLs
+Retrying failed URLs: 18 (11.5% failed)
+[━━━━━━━━━━━━━━━━━━━━━━━━━━━━] 100% • 18/18 retries
+Crawling Summary:
+Total URLs processed: 156
+Initial failures: 18 (11.5%)
+Final results:
+  ✓ Successful: 150
+  ✗ Failed: 6
+Retry success rate: 12/18 (66.7%)
+```
+## Output Structure
+### 1. Directory Layout
+```
+content/                    # Output directory
+├── example-com-page1.md   # Markdown files
+├── example-com-page2.md
+└── report.json            # Crawl report
+```
+### 2. Report Format
+```json
+{
+  "timestamp": "2025-02-15T10:30:00",
+  "config": {
+    "target_selector": "article",
+    "remove_selectors": [".ads"]
+  },
+  "results": {
+    "successful": [...],
+    "failed": [...]
+  },
+  "summary": {
+    "total": 156,
+    "successful": 150,
+    "failed": 6
+  }
+}
+```
+## Performance Optimization
+1. Server-side Parallel Processing
+   - Recommended for most cases
+   - Single HTTP request
+   - Reduced network overhead
+   - Built-in load balancing
+2. Client-side Parallel Processing
+   - Better control over processing
+   - Customizable concurrency
+   - Progress tracking per URL
+   - Automatic retry handling
+3. Asynchronous Processing
+   - Ideal for async applications
+   - Non-blocking operation
+   - Real-time progress updates
+   - Efficient resource usage
+## Error Handling
+The package provides comprehensive error handling:
+- Automatic retry for failed URLs
+- Failure ratio monitoring
+- Detailed error reporting
+- Webhook error notifications
+- Progress tracking during retries
+## Requirements
+- Python 3.11+
+- Running SpiderForce4AI service
+- Internet connection
+## Dependencies
+- aiohttp
+- asyncio
+- rich
+- aiofiles
+- httpx
+## License
+MIT License
+## Credits
+Created by [Peter Tam](https://petertam.pro)

{spiderforce4ai-1.1 → spiderforce4ai-1.3}/pyproject.toml RENAMED Viewed

@@ -4,7 +4,7 @@ build-backend = "setuptools.build_meta"
 [project]
 name = "spiderforce4ai"
-version = "1.1"
+version = "1.3"
 description = "Python wrapper for SpiderForce4AI HTML-to-Markdown conversion service"
 readme = "README.md"
 authors = [{name = "Piotr Tamulewicz", email = "pt@petertam.pro"}]

{spiderforce4ai-1.1 → spiderforce4ai-1.3}/setup.py RENAMED Viewed

@@ -3,7 +3,7 @@ from setuptools import setup, find_packages
 setup(
     name="spiderforce4ai",
-    version="1.1",
+    version="1.3",
     author="Piotr Tamulewicz",
     author_email="pt@petertam.pro",
     description="Python wrapper for SpiderForce4AI HTML-to-Markdown conversion service",

{spiderforce4ai-1.1 → spiderforce4ai-1.3}/spiderforce4ai/__init__.py RENAMED Viewed

@@ -350,17 +350,23 @@ class SpiderForce4AI:
     def _save_report_sync(self, results: List[CrawlResult], config: CrawlConfig) -> None:
         """Save crawl report synchronously."""
+        # Separate successful and failed results
+        successful_results = [r for r in results if r.status == "success"]
+        failed_results = [r for r in results if r.status == "failed"]
+        # Create report with only final state
         report = {
             "timestamp": datetime.now().isoformat(),
             "config": config.to_dict(),
             "results": {
-                "successful": [asdict(r) for r in results if r.status == "success"],
-                "failed": [asdict(r) for r in results if r.status == "failed"]
+                "successful": [asdict(r) for r in successful_results],
+                "failed": [asdict(r) for r in failed_results]  # Only truly failed URLs after retries
             },
             "summary": {
                 "total": len(results),
-                "successful": len([r for r in results if r.status == "success"]),
-                "failed": len([r for r in results if r.status == "failed"])
+                "successful": len(successful_results),
+                "failed": len(failed_results),
+                "retry_info": getattr(self, '_retry_stats', {})  # Include retry statistics if available
             }
         }
@@ -372,17 +378,22 @@ class SpiderForce4AI:
         if not config.report_file:
             return
+        # Separate successful and failed results
+        successful_results = [r for r in self.crawl_results if r.status == "success"]
+        failed_results = [r for r in self.crawl_results if r.status == "failed"]
         report = {
             "timestamp": datetime.now().isoformat(),
             "config": config.to_dict(),
             "results": {
-                "successful": [asdict(r) for r in self.crawl_results if r.status == "success"],
-                "failed": [asdict(r) for r in self.crawl_results if r.status == "failed"]
+                "successful": [asdict(r) for r in successful_results],
+                "failed": [asdict(r) for r in failed_results]  # Only truly failed URLs after retries
             },
             "summary": {
                 "total": len(self.crawl_results),
-                "successful": len([r for r in self.crawl_results if r.status == "success"]),
-                "failed": len([r for r in self.crawl_results if r.status == "failed"])
+                "successful": len(successful_results),
+                "failed": len(failed_results),
+                "retry_info": getattr(self, '_retry_stats', {})  # Include retry statistics if available
             }
         }
@@ -535,8 +546,13 @@ class SpiderForce4AI:
                     results = initial_results
                 else:
                     retry_results = await self._retry_failed_urls(failed_results, config, progress)
-                    # Replace failed results with retry results
-                    results = [r for r in initial_results if r.status == "success"] + retry_results
+                    # Update results list by replacing failed results with successful retries
+                    results = initial_results.copy()
+                    for retry_result in retry_results:
+                        for i, result in enumerate(results):
+                            if result.url == retry_result.url:
+                                results[i] = retry_result
+                                break
             else:
                 results = initial_results
@@ -661,12 +677,27 @@ class SpiderForce4AI:
                 console.print(f"\n[yellow]Retrying failed URLs: {failed_count} ({failure_ratio:.1f}% failed)[/yellow]")
                 for result in failed_results:
                     new_result = _process_url_parallel((result.url, self.base_url, config))
+                    # Save markdown and trigger webhook for successful retries
                     if new_result.status == "success":
                         console.print(f"[green]✓ Retry successful: {result.url}[/green]")
-                        # Replace the failed result with the successful retry
-                        results[results.index(result)] = new_result
+                        # Save markdown if output directory is configured
+                        if config.output_dir and new_result.markdown:
+                            filepath = config.output_dir / f"{slugify(new_result.url)}.md"
+                            with open(filepath, 'w', encoding='utf-8') as f:
+                                f.write(new_result.markdown)
+                        # Send webhook for successful retry
+                        _send_webhook_sync(new_result, config)
                     else:
                         console.print(f"[red]✗ Retry failed: {result.url} - {new_result.error}[/red]")
+                        # Send webhook for failed retry
+                        _send_webhook_sync(new_result, config)
+                    # Update results list
+                    for i, r in enumerate(results):
+                        if r.url == new_result.url:
+                            results[i] = new_result
+                            break
         # Calculate final statistics
         final_successful = len([r for r in results if r.status == "success"])

spiderforce4ai 1.1__tar.gz → 1.3__tar.gz

spiderforce4ai 1.1tar.gz → 1.3tar.gz