PyPI - spiderforce4ai - Versions diffs - 0.1.7__tar.gz → 0.1.8__tar.gz - Mend

spiderforce4ai 0.1.7tar.gz → 0.1.8tar.gz

Files changed (11) hide show

{spiderforce4ai-0.1.7 → spiderforce4ai-0.1.8}/PKG-INFO RENAMED Viewed

@@ -1,6 +1,6 @@
 Metadata-Version: 2.2
 Name: spiderforce4ai
-Version: 0.1.7
+Version: 0.1.8
 Summary: Python wrapper for SpiderForce4AI HTML-to-Markdown conversion service
 Home-page: https://petertam.pro
 Author: Piotr Tamulewicz
@@ -24,75 +24,73 @@ Dynamic: requires-python
 # SpiderForce4AI Python Wrapper
-A Python wrapper for SpiderForce4AI - a powerful HTML-to-Markdown conversion service. This package provides an easy-to-use interface for crawling websites and converting their content to clean Markdown format.
-## Installation
-```bash
-pip install spiderforce4ai
-```
+A Python package for web content crawling and HTML-to-Markdown conversion. Built for seamless integration with SpiderForce4AI service.
 ## Quick Start (Minimal Setup)
 ```python
 from spiderforce4ai import SpiderForce4AI, CrawlConfig
-# Initialize with your SpiderForce4AI service URL
+# Initialize with your service URL
 spider = SpiderForce4AI("http://localhost:3004")
-# Use default configuration (will save in ./spiderforce_reports)
+# Create default config
 config = CrawlConfig()
 # Crawl a single URL
 result = spider.crawl_url("https://example.com", config)
 ```
+## Installation
+```bash
+pip install spiderforce4ai
+```
 ## Crawling Methods
-### 1. Single URL Crawling
+### 1. Single URL
 ```python
-# Synchronous
+# Basic usage
 result = spider.crawl_url("https://example.com", config)
-# Asynchronous
+# Async version
 async def crawl():
     result = await spider.crawl_url_async("https://example.com", config)
 ```
-### 2. Multiple URLs Crawling
+### 2. Multiple URLs
 ```python
-# List of URLs
 urls = [
     "https://example.com/page1",
-    "https://example.com/page2",
-    "https://example.com/page3"
+    "https://example.com/page2"
 ]
-# Synchronous
-results = spider.crawl_urls(urls, config)
+# Client-side parallel (using multiprocessing)
+results = spider.crawl_urls_parallel(urls, config)
+# Server-side parallel (single request)
+results = spider.crawl_urls_server_parallel(urls, config)
-# Asynchronous
+# Async version
 async def crawl():
     results = await spider.crawl_urls_async(urls, config)
-# Parallel (using multiprocessing)
-results = spider.crawl_urls_parallel(urls, config)
 ```
 ### 3. Sitemap Crawling
 ```python
-# Synchronous
-results = spider.crawl_sitemap("https://example.com/sitemap.xml", config)
+# Server-side parallel (recommended)
+results = spider.crawl_sitemap_server_parallel("https://example.com/sitemap.xml", config)
+# Client-side parallel
+results = spider.crawl_sitemap_parallel("https://example.com/sitemap.xml", config)
-# Asynchronous
+# Async version
 async def crawl():
     results = await spider.crawl_sitemap_async("https://example.com/sitemap.xml", config)
-# Parallel (using multiprocessing)
-results = spider.crawl_sitemap_parallel("https://example.com/sitemap.xml", config)
 ```
 ## Configuration Options
@@ -100,9 +98,11 @@ results = spider.crawl_sitemap_parallel("https://example.com/sitemap.xml", confi
 All configuration options are optional with sensible defaults:
 ```python
+from pathlib import Path
 config = CrawlConfig(
     # Content Selection (all optional)
-    target_selector="article",              # Specific element to target
+    target_selector="article",              # Specific element to extract
     remove_selectors=[                      # Elements to remove
         ".ads",
         "#popup",
@@ -112,21 +112,21 @@ config = CrawlConfig(
     remove_selectors_regex=["modal-\\d+"],  # Regex patterns for removal
     # Processing Settings
-    max_concurrent_requests=1,              # Default: 1 (parallel processing)
-    request_delay=0.5,                     # Delay between requests in seconds
-    timeout=30,                            # Request timeout in seconds
+    max_concurrent_requests=1,              # For client-side parallel processing
+    request_delay=0.5,                     # Delay between requests (seconds)
+    timeout=30,                            # Request timeout (seconds)
     # Output Settings
-    output_dir="custom_output",            # Default: "spiderforce_reports"
-    report_file="custom_report.json",      # Default: "crawl_report.json"
-    webhook_url="https://your-webhook.com", # Optional webhook endpoint
-    webhook_timeout=10                      # Webhook timeout in seconds
+    output_dir=Path("spiderforce_reports"), # Default directory for files
+    webhook_url="https://your-webhook.com", # Real-time notifications
+    webhook_timeout=10,                     # Webhook timeout
+    report_file=Path("crawl_report.json")   # Final report location
 )
 ```
 ## Real-World Examples
-### 1. Basic Website Crawling
+### 1. Basic Blog Crawling
 ```python
 from spiderforce4ai import SpiderForce4AI, CrawlConfig
@@ -134,78 +134,77 @@ from pathlib import Path
 spider = SpiderForce4AI("http://localhost:3004")
 config = CrawlConfig(
+    target_selector="article.post-content",
     output_dir=Path("blog_content")
 )
-result = spider.crawl_url("https://example.com/blog", config)
-print(f"Content saved to: {result.url}.md")
+result = spider.crawl_url("https://example.com/blog-post", config)
 ```
-### 2. Advanced Parallel Sitemap Crawling
+### 2. Parallel Website Crawling
 ```python
 config = CrawlConfig(
-    max_concurrent_requests=5,
-    output_dir=Path("website_content"),
     remove_selectors=[
         ".navigation",
         ".footer",
         ".ads",
         "#cookie-notice"
     ],
+    max_concurrent_requests=5,
+    output_dir=Path("website_content"),
     webhook_url="https://your-webhook.com/endpoint"
 )
-results = spider.crawl_sitemap_parallel(
-    "https://example.com/sitemap.xml",
-    config
-)
+# Using server-side parallel processing
+results = spider.crawl_urls_server_parallel([
+    "https://example.com/page1",
+    "https://example.com/page2",
+    "https://example.com/page3"
+], config)
 ```
-### 3. Async Crawling with Progress
+### 3. Full Sitemap Processing
 ```python
-import asyncio
-async def main():
-    config = CrawlConfig(
-        max_concurrent_requests=3,
-        request_delay=1.0
-    )
-    async with spider:
-        results = await spider.crawl_urls_async([
-            "https://example.com/1",
-            "https://example.com/2",
-            "https://example.com/3"
-        ], config)
-    return results
+config = CrawlConfig(
+    target_selector="main",
+    remove_selectors=[".sidebar", ".comments"],
+    output_dir=Path("site_content"),
+    report_file=Path("crawl_report.json")
+)
-results = asyncio.run(main())
+results = spider.crawl_sitemap_server_parallel(
+    "https://example.com/sitemap.xml",
+    config
+)
 ```
 ## Output Structure
-### 1. File Organization
+### 1. Directory Layout
 ```
-output_dir/
-├── example-com-page1.md
+spiderforce_reports/           # Default output directory
+├── example-com-page1.md      # Converted markdown files
 ├── example-com-page2.md
-└── crawl_report.json
+└── crawl_report.json         # Crawl report
 ```
 ### 2. Markdown Files
-Each markdown file is named using a slugified version of the URL and contains the converted content.
+Each file is named using a slugified version of the URL:
+```markdown
+# Page Title
+Content converted to clean markdown...
+```
-### 3. Report JSON Structure
+### 3. Crawl Report
 ```json
 {
   "timestamp": "2025-02-15T10:30:00.123456",
   "config": {
     "target_selector": "article",
-    "remove_selectors": [".ads", "#popup"],
-    "remove_selectors_regex": ["modal-\\d+"]
+    "remove_selectors": [".ads", "#popup"]
   },
   "results": {
     "successful": [
@@ -234,7 +233,7 @@ Each markdown file is named using a slugified version of the URL and contains th
 ```
 ### 4. Webhook Notifications
-If configured, webhooks receive real-time updates in JSON format:
+If configured, real-time updates are sent for each processed URL:
 ```json
 {
   "url": "https://example.com/page1",
@@ -250,7 +249,7 @@ If configured, webhooks receive real-time updates in JSON format:
 ## Error Handling
-The package handles various types of errors:
+The package handles various types of errors gracefully:
 - Network errors
 - Timeout errors
 - Invalid URLs
@@ -269,6 +268,25 @@ All errors are:
 - Running SpiderForce4AI service
 - Internet connection
+## Performance Considerations
+1. Server-side Parallel Processing
+   - Best for most cases
+   - Single HTTP request for multiple URLs
+   - Less network overhead
+   - Use: `crawl_urls_server_parallel()` or `crawl_sitemap_server_parallel()`
+2. Client-side Parallel Processing
+   - Good for special cases requiring local control
+   - Uses Python multiprocessing
+   - More network overhead
+   - Use: `crawl_urls_parallel()` or `crawl_sitemap_parallel()`
+3. Async Processing
+   - Best for integration with async applications
+   - Good for real-time processing
+   - Use: `crawl_url_async()`, `crawl_urls_async()`, or `crawl_sitemap_async()`
 ## License
 MIT License

{spiderforce4ai-0.1.7 → spiderforce4ai-0.1.8}/README.md RENAMED Viewed

@@ -1,74 +1,72 @@
 # SpiderForce4AI Python Wrapper
-A Python wrapper for SpiderForce4AI - a powerful HTML-to-Markdown conversion service. This package provides an easy-to-use interface for crawling websites and converting their content to clean Markdown format.
-## Installation
-```bash
-pip install spiderforce4ai
-```
+A Python package for web content crawling and HTML-to-Markdown conversion. Built for seamless integration with SpiderForce4AI service.
 ## Quick Start (Minimal Setup)
 ```python
 from spiderforce4ai import SpiderForce4AI, CrawlConfig
-# Initialize with your SpiderForce4AI service URL
+# Initialize with your service URL
 spider = SpiderForce4AI("http://localhost:3004")
-# Use default configuration (will save in ./spiderforce_reports)
+# Create default config
 config = CrawlConfig()
 # Crawl a single URL
 result = spider.crawl_url("https://example.com", config)
 ```
+## Installation
+```bash
+pip install spiderforce4ai
+```
 ## Crawling Methods
-### 1. Single URL Crawling
+### 1. Single URL
 ```python
-# Synchronous
+# Basic usage
 result = spider.crawl_url("https://example.com", config)
-# Asynchronous
+# Async version
 async def crawl():
     result = await spider.crawl_url_async("https://example.com", config)
 ```
-### 2. Multiple URLs Crawling
+### 2. Multiple URLs
 ```python
-# List of URLs
 urls = [
     "https://example.com/page1",
-    "https://example.com/page2",
-    "https://example.com/page3"
+    "https://example.com/page2"
 ]
-# Synchronous
-results = spider.crawl_urls(urls, config)
+# Client-side parallel (using multiprocessing)
+results = spider.crawl_urls_parallel(urls, config)
+# Server-side parallel (single request)
+results = spider.crawl_urls_server_parallel(urls, config)
-# Asynchronous
+# Async version
 async def crawl():
     results = await spider.crawl_urls_async(urls, config)
-# Parallel (using multiprocessing)
-results = spider.crawl_urls_parallel(urls, config)
 ```
 ### 3. Sitemap Crawling
 ```python
-# Synchronous
-results = spider.crawl_sitemap("https://example.com/sitemap.xml", config)
+# Server-side parallel (recommended)
+results = spider.crawl_sitemap_server_parallel("https://example.com/sitemap.xml", config)
+# Client-side parallel
+results = spider.crawl_sitemap_parallel("https://example.com/sitemap.xml", config)
-# Asynchronous
+# Async version
 async def crawl():
     results = await spider.crawl_sitemap_async("https://example.com/sitemap.xml", config)
-# Parallel (using multiprocessing)
-results = spider.crawl_sitemap_parallel("https://example.com/sitemap.xml", config)
 ```
 ## Configuration Options
@@ -76,9 +74,11 @@ results = spider.crawl_sitemap_parallel("https://example.com/sitemap.xml", confi
 All configuration options are optional with sensible defaults:
 ```python
+from pathlib import Path
 config = CrawlConfig(
     # Content Selection (all optional)
-    target_selector="article",              # Specific element to target
+    target_selector="article",              # Specific element to extract
     remove_selectors=[                      # Elements to remove
         ".ads",
         "#popup",
@@ -88,21 +88,21 @@ config = CrawlConfig(
     remove_selectors_regex=["modal-\\d+"],  # Regex patterns for removal
     # Processing Settings
-    max_concurrent_requests=1,              # Default: 1 (parallel processing)
-    request_delay=0.5,                     # Delay between requests in seconds
-    timeout=30,                            # Request timeout in seconds
+    max_concurrent_requests=1,              # For client-side parallel processing
+    request_delay=0.5,                     # Delay between requests (seconds)
+    timeout=30,                            # Request timeout (seconds)
     # Output Settings
-    output_dir="custom_output",            # Default: "spiderforce_reports"
-    report_file="custom_report.json",      # Default: "crawl_report.json"
-    webhook_url="https://your-webhook.com", # Optional webhook endpoint
-    webhook_timeout=10                      # Webhook timeout in seconds
+    output_dir=Path("spiderforce_reports"), # Default directory for files
+    webhook_url="https://your-webhook.com", # Real-time notifications
+    webhook_timeout=10,                     # Webhook timeout
+    report_file=Path("crawl_report.json")   # Final report location
 )
 ```
 ## Real-World Examples
-### 1. Basic Website Crawling
+### 1. Basic Blog Crawling
 ```python
 from spiderforce4ai import SpiderForce4AI, CrawlConfig
@@ -110,78 +110,77 @@ from pathlib import Path
 spider = SpiderForce4AI("http://localhost:3004")
 config = CrawlConfig(
+    target_selector="article.post-content",
     output_dir=Path("blog_content")
 )
-result = spider.crawl_url("https://example.com/blog", config)
-print(f"Content saved to: {result.url}.md")
+result = spider.crawl_url("https://example.com/blog-post", config)
 ```
-### 2. Advanced Parallel Sitemap Crawling
+### 2. Parallel Website Crawling
 ```python
 config = CrawlConfig(
-    max_concurrent_requests=5,
-    output_dir=Path("website_content"),
     remove_selectors=[
         ".navigation",
         ".footer",
         ".ads",
         "#cookie-notice"
     ],
+    max_concurrent_requests=5,
+    output_dir=Path("website_content"),
     webhook_url="https://your-webhook.com/endpoint"
 )
-results = spider.crawl_sitemap_parallel(
-    "https://example.com/sitemap.xml",
-    config
-)
+# Using server-side parallel processing
+results = spider.crawl_urls_server_parallel([
+    "https://example.com/page1",
+    "https://example.com/page2",
+    "https://example.com/page3"
+], config)
 ```
-### 3. Async Crawling with Progress
+### 3. Full Sitemap Processing
 ```python
-import asyncio
-async def main():
-    config = CrawlConfig(
-        max_concurrent_requests=3,
-        request_delay=1.0
-    )
-    async with spider:
-        results = await spider.crawl_urls_async([
-            "https://example.com/1",
-            "https://example.com/2",
-            "https://example.com/3"
-        ], config)
-    return results
+config = CrawlConfig(
+    target_selector="main",
+    remove_selectors=[".sidebar", ".comments"],
+    output_dir=Path("site_content"),
+    report_file=Path("crawl_report.json")
+)
-results = asyncio.run(main())
+results = spider.crawl_sitemap_server_parallel(
+    "https://example.com/sitemap.xml",
+    config
+)
 ```
 ## Output Structure
-### 1. File Organization
+### 1. Directory Layout
 ```
-output_dir/
-├── example-com-page1.md
+spiderforce_reports/           # Default output directory
+├── example-com-page1.md      # Converted markdown files
 ├── example-com-page2.md
-└── crawl_report.json
+└── crawl_report.json         # Crawl report
 ```
 ### 2. Markdown Files
-Each markdown file is named using a slugified version of the URL and contains the converted content.
+Each file is named using a slugified version of the URL:
+```markdown
+# Page Title
+Content converted to clean markdown...
+```
-### 3. Report JSON Structure
+### 3. Crawl Report
 ```json
 {
   "timestamp": "2025-02-15T10:30:00.123456",
   "config": {
     "target_selector": "article",
-    "remove_selectors": [".ads", "#popup"],
-    "remove_selectors_regex": ["modal-\\d+"]
+    "remove_selectors": [".ads", "#popup"]
   },
   "results": {
     "successful": [
@@ -210,7 +209,7 @@ Each markdown file is named using a slugified version of the URL and contains th
 ```
 ### 4. Webhook Notifications
-If configured, webhooks receive real-time updates in JSON format:
+If configured, real-time updates are sent for each processed URL:
 ```json
 {
   "url": "https://example.com/page1",
@@ -226,7 +225,7 @@ If configured, webhooks receive real-time updates in JSON format:
 ## Error Handling
-The package handles various types of errors:
+The package handles various types of errors gracefully:
 - Network errors
 - Timeout errors
 - Invalid URLs
@@ -245,6 +244,25 @@ All errors are:
 - Running SpiderForce4AI service
 - Internet connection
+## Performance Considerations
+1. Server-side Parallel Processing
+   - Best for most cases
+   - Single HTTP request for multiple URLs
+   - Less network overhead
+   - Use: `crawl_urls_server_parallel()` or `crawl_sitemap_server_parallel()`
+2. Client-side Parallel Processing
+   - Good for special cases requiring local control
+   - Uses Python multiprocessing
+   - More network overhead
+   - Use: `crawl_urls_parallel()` or `crawl_sitemap_parallel()`
+3. Async Processing
+   - Best for integration with async applications
+   - Good for real-time processing
+   - Use: `crawl_url_async()`, `crawl_urls_async()`, or `crawl_sitemap_async()`
 ## License
 MIT License

{spiderforce4ai-0.1.7 → spiderforce4ai-0.1.8}/pyproject.toml RENAMED Viewed

@@ -4,7 +4,7 @@ build-backend = "setuptools.build_meta"
 [project]
 name = "spiderforce4ai"
-version = "0.1.7"
+version = "0.1.8"
 description = "Python wrapper for SpiderForce4AI HTML-to-Markdown conversion service"
 readme = "README.md"
 authors = [{name = "Piotr Tamulewicz", email = "pt@petertam.pro"}]

{spiderforce4ai-0.1.7 → spiderforce4ai-0.1.8}/setup.py RENAMED Viewed

@@ -3,7 +3,7 @@ from setuptools import setup, find_packages
 setup(
     name="spiderforce4ai",
-    version="0.1.7",
+    version="0.1.8",
     author="Piotr Tamulewicz",
     author_email="pt@petertam.pro",
     description="Python wrapper for SpiderForce4AI HTML-to-Markdown conversion service",

{spiderforce4ai-0.1.7 → spiderforce4ai-0.1.8}/spiderforce4ai/__init__.py RENAMED Viewed

@@ -196,6 +196,113 @@ class SpiderForce4AI:
             await f.write(markdown)
         return filepath
+    def crawl_sitemap_server_parallel(self, sitemap_url: str, config: CrawlConfig) -> List[CrawlResult]:
+        """
+        Crawl sitemap URLs using server-side parallel processing.
+        """
+        print(f"Fetching sitemap from {sitemap_url}...")
+        # Fetch sitemap
+        try:
+            response = requests.get(sitemap_url, timeout=config.timeout)
+            response.raise_for_status()
+            sitemap_text = response.text
+        except Exception as e:
+            print(f"Error fetching sitemap: {str(e)}")
+            raise
+        # Parse sitemap
+        try:
+            root = ET.fromstring(sitemap_text)
+            namespace = {'ns': root.tag.split('}')[0].strip('{')}
+            urls = [loc.text for loc in root.findall('.//ns:loc', namespace)]
+            print(f"Found {len(urls)} URLs in sitemap")
+        except Exception as e:
+            print(f"Error parsing sitemap: {str(e)}")
+            raise
+        # Process URLs using server-side parallel endpoint
+        return self.crawl_urls_server_parallel(urls, config)
+    def crawl_urls_server_parallel(self, urls: List[str], config: CrawlConfig) -> List[CrawlResult]:
+        """
+        Crawl multiple URLs using server-side parallel processing.
+        This uses the /convert_parallel endpoint which handles parallelization on the server.
+        """
+        print(f"Sending {len(urls)} URLs for parallel processing...")
+        try:
+            endpoint = f"{self.base_url}/convert_parallel"
+            # Prepare payload
+            payload = {
+                "urls": urls,
+                **config.to_dict()
+            }
+            # Send request
+            response = requests.post(
+                endpoint,
+                json=payload,
+                timeout=config.timeout
+            )
+            response.raise_for_status()
+            # Process results
+            results = []
+            server_results = response.json()  # Assuming server returns JSON array of results
+            for url_result in server_results:
+                result = CrawlResult(
+                    url=url_result["url"],
+                    status=url_result.get("status", "failed"),
+                    markdown=url_result.get("markdown"),
+                    error=url_result.get("error"),
+                    config=config.to_dict()
+                )
+                # Save markdown if successful and output dir is configured
+                if result.status == "success" and config.output_dir and result.markdown:
+                    filepath = config.output_dir / f"{slugify(result.url)}.md"
+                    with open(filepath, 'w', encoding='utf-8') as f:
+                        f.write(result.markdown)
+                # Send webhook if configured
+                if config.webhook_url:
+                    _send_webhook_sync(result, config)
+                results.append(result)
+            # Save report if configured
+            if config.report_file:
+                self._save_report_sync(results, config)
+                print(f"\nReport saved to: {config.report_file}")
+            # Print summary
+            successful = len([r for r in results if r.status == "success"])
+            failed = len([r for r in results if r.status == "failed"])
+            print(f"\nParallel processing completed:")
+            print(f"✓ Successful: {successful}")
+            print(f"✗ Failed: {failed}")
+            return results
+        except Exception as e:
+            print(f"Error during parallel processing: {str(e)}")
+            # Create failed results for all URLs
+            return [
+                CrawlResult(
+                    url=url,
+                    status="failed",
+                    error=str(e),
+                    config=config.to_dict()
+                ) for url in urls
+            ]
     async def _send_webhook(self, result: CrawlResult, config: CrawlConfig):
         """Send webhook with crawl results."""
         if not config.webhook_url:

{spiderforce4ai-0.1.7 → spiderforce4ai-0.1.8}/spiderforce4ai.egg-info/PKG-INFO RENAMED Viewed

@@ -1,6 +1,6 @@
 Metadata-Version: 2.2
 Name: spiderforce4ai
-Version: 0.1.7
+Version: 0.1.8
 Summary: Python wrapper for SpiderForce4AI HTML-to-Markdown conversion service
 Home-page: https://petertam.pro
 Author: Piotr Tamulewicz
@@ -24,75 +24,73 @@ Dynamic: requires-python
 # SpiderForce4AI Python Wrapper
-A Python wrapper for SpiderForce4AI - a powerful HTML-to-Markdown conversion service. This package provides an easy-to-use interface for crawling websites and converting their content to clean Markdown format.
-## Installation
-```bash
-pip install spiderforce4ai
-```
+A Python package for web content crawling and HTML-to-Markdown conversion. Built for seamless integration with SpiderForce4AI service.
 ## Quick Start (Minimal Setup)
 ```python
 from spiderforce4ai import SpiderForce4AI, CrawlConfig
-# Initialize with your SpiderForce4AI service URL
+# Initialize with your service URL
 spider = SpiderForce4AI("http://localhost:3004")
-# Use default configuration (will save in ./spiderforce_reports)
+# Create default config
 config = CrawlConfig()
 # Crawl a single URL
 result = spider.crawl_url("https://example.com", config)
 ```
+## Installation
+```bash
+pip install spiderforce4ai
+```
 ## Crawling Methods
-### 1. Single URL Crawling
+### 1. Single URL
 ```python
-# Synchronous
+# Basic usage
 result = spider.crawl_url("https://example.com", config)
-# Asynchronous
+# Async version
 async def crawl():
     result = await spider.crawl_url_async("https://example.com", config)
 ```
-### 2. Multiple URLs Crawling
+### 2. Multiple URLs
 ```python
-# List of URLs
 urls = [
     "https://example.com/page1",
-    "https://example.com/page2",
-    "https://example.com/page3"
+    "https://example.com/page2"
 ]
-# Synchronous
-results = spider.crawl_urls(urls, config)
+# Client-side parallel (using multiprocessing)
+results = spider.crawl_urls_parallel(urls, config)
+# Server-side parallel (single request)
+results = spider.crawl_urls_server_parallel(urls, config)
-# Asynchronous
+# Async version
 async def crawl():
     results = await spider.crawl_urls_async(urls, config)
-# Parallel (using multiprocessing)
-results = spider.crawl_urls_parallel(urls, config)
 ```
 ### 3. Sitemap Crawling
 ```python
-# Synchronous
-results = spider.crawl_sitemap("https://example.com/sitemap.xml", config)
+# Server-side parallel (recommended)
+results = spider.crawl_sitemap_server_parallel("https://example.com/sitemap.xml", config)
+# Client-side parallel
+results = spider.crawl_sitemap_parallel("https://example.com/sitemap.xml", config)
-# Asynchronous
+# Async version
 async def crawl():
     results = await spider.crawl_sitemap_async("https://example.com/sitemap.xml", config)
-# Parallel (using multiprocessing)
-results = spider.crawl_sitemap_parallel("https://example.com/sitemap.xml", config)
 ```
 ## Configuration Options
@@ -100,9 +98,11 @@ results = spider.crawl_sitemap_parallel("https://example.com/sitemap.xml", confi
 All configuration options are optional with sensible defaults:
 ```python
+from pathlib import Path
 config = CrawlConfig(
     # Content Selection (all optional)
-    target_selector="article",              # Specific element to target
+    target_selector="article",              # Specific element to extract
     remove_selectors=[                      # Elements to remove
         ".ads",
         "#popup",
@@ -112,21 +112,21 @@ config = CrawlConfig(
     remove_selectors_regex=["modal-\\d+"],  # Regex patterns for removal
     # Processing Settings
-    max_concurrent_requests=1,              # Default: 1 (parallel processing)
-    request_delay=0.5,                     # Delay between requests in seconds
-    timeout=30,                            # Request timeout in seconds
+    max_concurrent_requests=1,              # For client-side parallel processing
+    request_delay=0.5,                     # Delay between requests (seconds)
+    timeout=30,                            # Request timeout (seconds)
     # Output Settings
-    output_dir="custom_output",            # Default: "spiderforce_reports"
-    report_file="custom_report.json",      # Default: "crawl_report.json"
-    webhook_url="https://your-webhook.com", # Optional webhook endpoint
-    webhook_timeout=10                      # Webhook timeout in seconds
+    output_dir=Path("spiderforce_reports"), # Default directory for files
+    webhook_url="https://your-webhook.com", # Real-time notifications
+    webhook_timeout=10,                     # Webhook timeout
+    report_file=Path("crawl_report.json")   # Final report location
 )
 ```
 ## Real-World Examples
-### 1. Basic Website Crawling
+### 1. Basic Blog Crawling
 ```python
 from spiderforce4ai import SpiderForce4AI, CrawlConfig
@@ -134,78 +134,77 @@ from pathlib import Path
 spider = SpiderForce4AI("http://localhost:3004")
 config = CrawlConfig(
+    target_selector="article.post-content",
     output_dir=Path("blog_content")
 )
-result = spider.crawl_url("https://example.com/blog", config)
-print(f"Content saved to: {result.url}.md")
+result = spider.crawl_url("https://example.com/blog-post", config)
 ```
-### 2. Advanced Parallel Sitemap Crawling
+### 2. Parallel Website Crawling
 ```python
 config = CrawlConfig(
-    max_concurrent_requests=5,
-    output_dir=Path("website_content"),
     remove_selectors=[
         ".navigation",
         ".footer",
         ".ads",
         "#cookie-notice"
     ],
+    max_concurrent_requests=5,
+    output_dir=Path("website_content"),
     webhook_url="https://your-webhook.com/endpoint"
 )
-results = spider.crawl_sitemap_parallel(
-    "https://example.com/sitemap.xml",
-    config
-)
+# Using server-side parallel processing
+results = spider.crawl_urls_server_parallel([
+    "https://example.com/page1",
+    "https://example.com/page2",
+    "https://example.com/page3"
+], config)
 ```
-### 3. Async Crawling with Progress
+### 3. Full Sitemap Processing
 ```python
-import asyncio
-async def main():
-    config = CrawlConfig(
-        max_concurrent_requests=3,
-        request_delay=1.0
-    )
-    async with spider:
-        results = await spider.crawl_urls_async([
-            "https://example.com/1",
-            "https://example.com/2",
-            "https://example.com/3"
-        ], config)
-    return results
+config = CrawlConfig(
+    target_selector="main",
+    remove_selectors=[".sidebar", ".comments"],
+    output_dir=Path("site_content"),
+    report_file=Path("crawl_report.json")
+)
-results = asyncio.run(main())
+results = spider.crawl_sitemap_server_parallel(
+    "https://example.com/sitemap.xml",
+    config
+)
 ```
 ## Output Structure
-### 1. File Organization
+### 1. Directory Layout
 ```
-output_dir/
-├── example-com-page1.md
+spiderforce_reports/           # Default output directory
+├── example-com-page1.md      # Converted markdown files
 ├── example-com-page2.md
-└── crawl_report.json
+└── crawl_report.json         # Crawl report
 ```
 ### 2. Markdown Files
-Each markdown file is named using a slugified version of the URL and contains the converted content.
+Each file is named using a slugified version of the URL:
+```markdown
+# Page Title
+Content converted to clean markdown...
+```
-### 3. Report JSON Structure
+### 3. Crawl Report
 ```json
 {
   "timestamp": "2025-02-15T10:30:00.123456",
   "config": {
     "target_selector": "article",
-    "remove_selectors": [".ads", "#popup"],
-    "remove_selectors_regex": ["modal-\\d+"]
+    "remove_selectors": [".ads", "#popup"]
   },
   "results": {
     "successful": [
@@ -234,7 +233,7 @@ Each markdown file is named using a slugified version of the URL and contains th
 ```
 ### 4. Webhook Notifications
-If configured, webhooks receive real-time updates in JSON format:
+If configured, real-time updates are sent for each processed URL:
 ```json
 {
   "url": "https://example.com/page1",
@@ -250,7 +249,7 @@ If configured, webhooks receive real-time updates in JSON format:
 ## Error Handling
-The package handles various types of errors:
+The package handles various types of errors gracefully:
 - Network errors
 - Timeout errors
 - Invalid URLs
@@ -269,6 +268,25 @@ All errors are:
 - Running SpiderForce4AI service
 - Internet connection
+## Performance Considerations
+1. Server-side Parallel Processing
+   - Best for most cases
+   - Single HTTP request for multiple URLs
+   - Less network overhead
+   - Use: `crawl_urls_server_parallel()` or `crawl_sitemap_server_parallel()`
+2. Client-side Parallel Processing
+   - Good for special cases requiring local control
+   - Uses Python multiprocessing
+   - More network overhead
+   - Use: `crawl_urls_parallel()` or `crawl_sitemap_parallel()`
+3. Async Processing
+   - Best for integration with async applications
+   - Good for real-time processing
+   - Use: `crawl_url_async()`, `crawl_urls_async()`, or `crawl_sitemap_async()`
 ## License
 MIT License

{spiderforce4ai-0.1.7 → spiderforce4ai-0.1.8}/setup.cfg RENAMED Viewed

File without changes

{spiderforce4ai-0.1.7 → spiderforce4ai-0.1.8}/spiderforce4ai.egg-info/SOURCES.txt RENAMED Viewed

File without changes

{spiderforce4ai-0.1.7 → spiderforce4ai-0.1.8}/spiderforce4ai.egg-info/dependency_links.txt RENAMED Viewed

File without changes

{spiderforce4ai-0.1.7 → spiderforce4ai-0.1.8}/spiderforce4ai.egg-info/requires.txt RENAMED Viewed

File without changes

{spiderforce4ai-0.1.7 → spiderforce4ai-0.1.8}/spiderforce4ai.egg-info/top_level.txt RENAMED Viewed

File without changes

spiderforce4ai 0.1.7__tar.gz → 0.1.8__tar.gz

spiderforce4ai 0.1.7tar.gz → 0.1.8tar.gz