spiderforce4ai 0.1.0__tar.gz

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
@@ -0,0 +1,239 @@
1
+ Metadata-Version: 2.2
2
+ Name: spiderforce4ai
3
+ Version: 0.1.0
4
+ Summary: Python wrapper for SpiderForce4AI HTML-to-Markdown conversion service
5
+ Home-page: https://petertam.pro
6
+ Author: Piotr Tamulewicz
7
+ Author-email: Piotr Tamulewicz <pt@petertam.pro>
8
+ License: MIT
9
+ Classifier: Development Status :: 4 - Beta
10
+ Classifier: Intended Audience :: Developers
11
+ Classifier: License :: OSI Approved :: MIT License
12
+ Classifier: Programming Language :: Python :: 3.11
13
+ Classifier: Programming Language :: Python :: 3.12
14
+ Requires-Python: >=3.11
15
+ Description-Content-Type: text/markdown
16
+ Requires-Dist: aiohttp>=3.8.0
17
+ Requires-Dist: asyncio>=3.4.3
18
+ Requires-Dist: rich>=10.0.0
19
+ Requires-Dist: aiofiles>=0.8.0
20
+ Requires-Dist: httpx>=0.24.0
21
+ Dynamic: author
22
+ Dynamic: home-page
23
+ Dynamic: requires-python
24
+
25
+ # SpiderForce4AI Python Wrapper
26
+
27
+ A Python wrapper for SpiderForce4AI - a powerful HTML-to-Markdown conversion service. This package provides an easy-to-use interface for crawling websites and converting their content to clean Markdown format.
28
+
29
+ ## Features
30
+
31
+ - 🔄 Simple synchronous and asynchronous APIs
32
+ - 📁 Automatic Markdown file saving with URL-based filenames
33
+ - 📊 Real-time progress tracking in console
34
+ - 🪝 Webhook support for real-time notifications
35
+ - 📝 Detailed crawl reports in JSON format
36
+ - ⚡ Concurrent crawling with rate limiting
37
+ - 🔍 Support for sitemap.xml crawling
38
+ - 🛡️ Comprehensive error handling
39
+
40
+ ## Installation
41
+
42
+ ```bash
43
+ pip install spiderforce4ai
44
+ ```
45
+
46
+ ## Quick Start
47
+
48
+ ```python
49
+ from spiderforce4ai import SpiderForce4AI, CrawlConfig
50
+
51
+ # Initialize the client
52
+ spider = SpiderForce4AI("http://localhost:3004")
53
+
54
+ # Use default configuration
55
+ config = CrawlConfig()
56
+
57
+ # Crawl a single URL
58
+ result = spider.crawl_url("https://example.com", config)
59
+
60
+ # Crawl multiple URLs
61
+ urls = [
62
+ "https://example.com/page1",
63
+ "https://example.com/page2"
64
+ ]
65
+ results = spider.crawl_urls(urls, config)
66
+
67
+ # Crawl from sitemap
68
+ results = spider.crawl_sitemap("https://example.com/sitemap.xml", config)
69
+ ```
70
+
71
+ ## Configuration
72
+
73
+ The `CrawlConfig` class provides various configuration options. All parameters are optional with sensible defaults:
74
+
75
+ ```python
76
+ config = CrawlConfig(
77
+ # Content Selection (all optional)
78
+ target_selector="article", # Specific element to target
79
+ remove_selectors=[".ads", "#popup"], # Elements to remove
80
+ remove_selectors_regex=["modal-\\d+"], # Regex patterns for removal
81
+
82
+ # Processing Settings
83
+ max_concurrent_requests=1, # Default: 1
84
+ request_delay=0.5, # Delay between requests in seconds
85
+ timeout=30, # Request timeout in seconds
86
+
87
+ # Output Settings
88
+ output_dir="spiderforce_reports", # Default output directory
89
+ webhook_url="https://your-webhook.com", # Optional webhook endpoint
90
+ webhook_timeout=10, # Webhook timeout in seconds
91
+ report_file=None # Optional custom report location
92
+ )
93
+ ```
94
+
95
+ ### Default Directory Structure
96
+
97
+ ```
98
+ ./
99
+ └── spiderforce_reports/
100
+ ├── example-com-page1.md
101
+ ├── example-com-page2.md
102
+ └── crawl_report.json
103
+ ```
104
+
105
+ ## Webhook Notifications
106
+
107
+ If `webhook_url` is configured, the crawler sends POST requests with the following JSON structure:
108
+
109
+ ```json
110
+ {
111
+ "url": "https://example.com/page1",
112
+ "status": "success",
113
+ "markdown": "# Page Title\n\nContent...",
114
+ "timestamp": "2025-02-15T10:30:00.123456",
115
+ "config": {
116
+ "target_selector": "article",
117
+ "remove_selectors": [".ads", "#popup"],
118
+ "remove_selectors_regex": ["modal-\\d+"]
119
+ }
120
+ }
121
+ ```
122
+
123
+ ## Crawl Report
124
+
125
+ A comprehensive JSON report is automatically generated in the output directory:
126
+
127
+ ```json
128
+ {
129
+ "timestamp": "2025-02-15T10:30:00.123456",
130
+ "config": {
131
+ "target_selector": "article",
132
+ "remove_selectors": [".ads", "#popup"],
133
+ "remove_selectors_regex": ["modal-\\d+"]
134
+ },
135
+ "results": {
136
+ "successful": [
137
+ {
138
+ "url": "https://example.com/page1",
139
+ "status": "success",
140
+ "markdown": "# Page Title\n\nContent...",
141
+ "timestamp": "2025-02-15T10:30:00.123456"
142
+ }
143
+ ],
144
+ "failed": [
145
+ {
146
+ "url": "https://example.com/page2",
147
+ "status": "failed",
148
+ "error": "HTTP 404: Not Found",
149
+ "timestamp": "2025-02-15T10:30:01.123456"
150
+ }
151
+ ]
152
+ },
153
+ "summary": {
154
+ "total": 2,
155
+ "successful": 1,
156
+ "failed": 1
157
+ }
158
+ }
159
+ ```
160
+
161
+ ## Async Usage
162
+
163
+ ```python
164
+ import asyncio
165
+ from spiderforce4ai import SpiderForce4AI, CrawlConfig
166
+
167
+ async def main():
168
+ config = CrawlConfig()
169
+ spider = SpiderForce4AI("http://localhost:3004")
170
+
171
+ async with spider:
172
+ results = await spider.crawl_urls_async(
173
+ ["https://example.com/page1", "https://example.com/page2"],
174
+ config
175
+ )
176
+
177
+ return results
178
+
179
+ if __name__ == "__main__":
180
+ results = asyncio.run(main())
181
+ ```
182
+
183
+ ## Error Handling
184
+
185
+ The crawler is designed to be resilient:
186
+ - Continues processing even if some URLs fail
187
+ - Records all errors in the crawl report
188
+ - Sends error notifications via webhook if configured
189
+ - Provides clear error messages in console output
190
+
191
+ ## Progress Tracking
192
+
193
+ The crawler provides real-time progress tracking in the console:
194
+
195
+ ```
196
+ 🔄 Crawling URLs... [####################] 100%
197
+ ✓ Successful: 95
198
+ ✗ Failed: 5
199
+ 📊 Report saved to: ./spiderforce_reports/crawl_report.json
200
+ ```
201
+
202
+ ## Usage with AI Agents
203
+
204
+ The package is designed to be easily integrated with AI agents and chat systems:
205
+
206
+ ```python
207
+ from spiderforce4ai import SpiderForce4AI, CrawlConfig
208
+
209
+ def fetch_content_for_ai(urls):
210
+ spider = SpiderForce4AI("http://localhost:3004")
211
+ config = CrawlConfig()
212
+
213
+ # Crawl content
214
+ results = spider.crawl_urls(urls, config)
215
+
216
+ # Return successful results
217
+ return {
218
+ result.url: result.markdown
219
+ for result in results
220
+ if result.status == "success"
221
+ }
222
+
223
+ # Use with AI agent
224
+ urls = ["https://example.com/article1", "https://example.com/article2"]
225
+ content = fetch_content_for_ai(urls)
226
+ ```
227
+
228
+ ## Requirements
229
+
230
+ - Python 3.11 or later
231
+ - Docker (for running SpiderForce4AI service)
232
+
233
+ ## License
234
+
235
+ MIT License
236
+
237
+ ## Credits
238
+
239
+ Created by [Peter Tam](https://petertam.pro)
@@ -0,0 +1,215 @@
1
+ # SpiderForce4AI Python Wrapper
2
+
3
+ A Python wrapper for SpiderForce4AI - a powerful HTML-to-Markdown conversion service. This package provides an easy-to-use interface for crawling websites and converting their content to clean Markdown format.
4
+
5
+ ## Features
6
+
7
+ - 🔄 Simple synchronous and asynchronous APIs
8
+ - 📁 Automatic Markdown file saving with URL-based filenames
9
+ - 📊 Real-time progress tracking in console
10
+ - 🪝 Webhook support for real-time notifications
11
+ - 📝 Detailed crawl reports in JSON format
12
+ - ⚡ Concurrent crawling with rate limiting
13
+ - 🔍 Support for sitemap.xml crawling
14
+ - 🛡️ Comprehensive error handling
15
+
16
+ ## Installation
17
+
18
+ ```bash
19
+ pip install spiderforce4ai
20
+ ```
21
+
22
+ ## Quick Start
23
+
24
+ ```python
25
+ from spiderforce4ai import SpiderForce4AI, CrawlConfig
26
+
27
+ # Initialize the client
28
+ spider = SpiderForce4AI("http://localhost:3004")
29
+
30
+ # Use default configuration
31
+ config = CrawlConfig()
32
+
33
+ # Crawl a single URL
34
+ result = spider.crawl_url("https://example.com", config)
35
+
36
+ # Crawl multiple URLs
37
+ urls = [
38
+ "https://example.com/page1",
39
+ "https://example.com/page2"
40
+ ]
41
+ results = spider.crawl_urls(urls, config)
42
+
43
+ # Crawl from sitemap
44
+ results = spider.crawl_sitemap("https://example.com/sitemap.xml", config)
45
+ ```
46
+
47
+ ## Configuration
48
+
49
+ The `CrawlConfig` class provides various configuration options. All parameters are optional with sensible defaults:
50
+
51
+ ```python
52
+ config = CrawlConfig(
53
+ # Content Selection (all optional)
54
+ target_selector="article", # Specific element to target
55
+ remove_selectors=[".ads", "#popup"], # Elements to remove
56
+ remove_selectors_regex=["modal-\\d+"], # Regex patterns for removal
57
+
58
+ # Processing Settings
59
+ max_concurrent_requests=1, # Default: 1
60
+ request_delay=0.5, # Delay between requests in seconds
61
+ timeout=30, # Request timeout in seconds
62
+
63
+ # Output Settings
64
+ output_dir="spiderforce_reports", # Default output directory
65
+ webhook_url="https://your-webhook.com", # Optional webhook endpoint
66
+ webhook_timeout=10, # Webhook timeout in seconds
67
+ report_file=None # Optional custom report location
68
+ )
69
+ ```
70
+
71
+ ### Default Directory Structure
72
+
73
+ ```
74
+ ./
75
+ └── spiderforce_reports/
76
+ ├── example-com-page1.md
77
+ ├── example-com-page2.md
78
+ └── crawl_report.json
79
+ ```
80
+
81
+ ## Webhook Notifications
82
+
83
+ If `webhook_url` is configured, the crawler sends POST requests with the following JSON structure:
84
+
85
+ ```json
86
+ {
87
+ "url": "https://example.com/page1",
88
+ "status": "success",
89
+ "markdown": "# Page Title\n\nContent...",
90
+ "timestamp": "2025-02-15T10:30:00.123456",
91
+ "config": {
92
+ "target_selector": "article",
93
+ "remove_selectors": [".ads", "#popup"],
94
+ "remove_selectors_regex": ["modal-\\d+"]
95
+ }
96
+ }
97
+ ```
98
+
99
+ ## Crawl Report
100
+
101
+ A comprehensive JSON report is automatically generated in the output directory:
102
+
103
+ ```json
104
+ {
105
+ "timestamp": "2025-02-15T10:30:00.123456",
106
+ "config": {
107
+ "target_selector": "article",
108
+ "remove_selectors": [".ads", "#popup"],
109
+ "remove_selectors_regex": ["modal-\\d+"]
110
+ },
111
+ "results": {
112
+ "successful": [
113
+ {
114
+ "url": "https://example.com/page1",
115
+ "status": "success",
116
+ "markdown": "# Page Title\n\nContent...",
117
+ "timestamp": "2025-02-15T10:30:00.123456"
118
+ }
119
+ ],
120
+ "failed": [
121
+ {
122
+ "url": "https://example.com/page2",
123
+ "status": "failed",
124
+ "error": "HTTP 404: Not Found",
125
+ "timestamp": "2025-02-15T10:30:01.123456"
126
+ }
127
+ ]
128
+ },
129
+ "summary": {
130
+ "total": 2,
131
+ "successful": 1,
132
+ "failed": 1
133
+ }
134
+ }
135
+ ```
136
+
137
+ ## Async Usage
138
+
139
+ ```python
140
+ import asyncio
141
+ from spiderforce4ai import SpiderForce4AI, CrawlConfig
142
+
143
+ async def main():
144
+ config = CrawlConfig()
145
+ spider = SpiderForce4AI("http://localhost:3004")
146
+
147
+ async with spider:
148
+ results = await spider.crawl_urls_async(
149
+ ["https://example.com/page1", "https://example.com/page2"],
150
+ config
151
+ )
152
+
153
+ return results
154
+
155
+ if __name__ == "__main__":
156
+ results = asyncio.run(main())
157
+ ```
158
+
159
+ ## Error Handling
160
+
161
+ The crawler is designed to be resilient:
162
+ - Continues processing even if some URLs fail
163
+ - Records all errors in the crawl report
164
+ - Sends error notifications via webhook if configured
165
+ - Provides clear error messages in console output
166
+
167
+ ## Progress Tracking
168
+
169
+ The crawler provides real-time progress tracking in the console:
170
+
171
+ ```
172
+ 🔄 Crawling URLs... [####################] 100%
173
+ ✓ Successful: 95
174
+ ✗ Failed: 5
175
+ 📊 Report saved to: ./spiderforce_reports/crawl_report.json
176
+ ```
177
+
178
+ ## Usage with AI Agents
179
+
180
+ The package is designed to be easily integrated with AI agents and chat systems:
181
+
182
+ ```python
183
+ from spiderforce4ai import SpiderForce4AI, CrawlConfig
184
+
185
+ def fetch_content_for_ai(urls):
186
+ spider = SpiderForce4AI("http://localhost:3004")
187
+ config = CrawlConfig()
188
+
189
+ # Crawl content
190
+ results = spider.crawl_urls(urls, config)
191
+
192
+ # Return successful results
193
+ return {
194
+ result.url: result.markdown
195
+ for result in results
196
+ if result.status == "success"
197
+ }
198
+
199
+ # Use with AI agent
200
+ urls = ["https://example.com/article1", "https://example.com/article2"]
201
+ content = fetch_content_for_ai(urls)
202
+ ```
203
+
204
+ ## Requirements
205
+
206
+ - Python 3.11 or later
207
+ - Docker (for running SpiderForce4AI service)
208
+
209
+ ## License
210
+
211
+ MIT License
212
+
213
+ ## Credits
214
+
215
+ Created by [Peter Tam](https://petertam.pro)
@@ -0,0 +1,26 @@
1
+ [build-system]
2
+ requires = ["setuptools>=45", "wheel"]
3
+ build-backend = "setuptools.build_meta"
4
+
5
+ [project]
6
+ name = "spiderforce4ai"
7
+ version = "0.1.0"
8
+ description = "Python wrapper for SpiderForce4AI HTML-to-Markdown conversion service"
9
+ readme = "README.md"
10
+ authors = [{name = "Piotr Tamulewicz", email = "pt@petertam.pro"}]
11
+ license = {text = "MIT"}
12
+ classifiers = [
13
+ "Development Status :: 4 - Beta",
14
+ "Intended Audience :: Developers",
15
+ "License :: OSI Approved :: MIT License",
16
+ "Programming Language :: Python :: 3.11",
17
+ "Programming Language :: Python :: 3.12",
18
+ ]
19
+ requires-python = ">=3.11"
20
+ dependencies = [
21
+ "aiohttp>=3.8.0",
22
+ "asyncio>=3.4.3",
23
+ "rich>=10.0.0",
24
+ "aiofiles>=0.8.0",
25
+ "httpx>=0.24.0"
26
+ ]
@@ -0,0 +1,4 @@
1
+ [egg_info]
2
+ tag_build =
3
+ tag_date = 0
4
+
@@ -0,0 +1,29 @@
1
+ # setup.py
2
+ from setuptools import setup, find_packages
3
+
4
+ setup(
5
+ name="spiderforce4ai",
6
+ version="0.1.0",
7
+ author="Piotr Tamulewicz",
8
+ author_email="pt@petertam.pro",
9
+ description="Python wrapper for SpiderForce4AI HTML-to-Markdown conversion service",
10
+ long_description=open("README.md").read(),
11
+ long_description_content_type="text/markdown",
12
+ url="https://petertam.pro",
13
+ packages=find_packages(),
14
+ classifiers=[
15
+ "Development Status :: 4 - Beta",
16
+ "Intended Audience :: Developers",
17
+ "License :: OSI Approved :: MIT License",
18
+ "Programming Language :: Python :: 3.11",
19
+ "Programming Language :: Python :: 3.12",
20
+ ],
21
+ python_requires=">=3.11",
22
+ install_requires=[
23
+ "aiohttp>=3.8.0",
24
+ "asyncio>=3.4.3",
25
+ "rich>=10.0.0",
26
+ "aiofiles>=0.8.0",
27
+ "httpx>=0.24.0"
28
+ ],
29
+ )
@@ -0,0 +1,303 @@
1
+ """
2
+ SpiderForce4AI Python Wrapper
3
+ A Python package for interacting with SpiderForce4AI HTML-to-Markdown conversion service.
4
+ """
5
+
6
+ import asyncio
7
+ import aiohttp
8
+ import json
9
+ import logging
10
+ from typing import List, Dict, Union, Optional
11
+ from dataclasses import dataclass, asdict
12
+ from urllib.parse import urljoin, urlparse
13
+ from pathlib import Path
14
+ import time
15
+ import xml.etree.ElementTree as ET
16
+ from concurrent.futures import ThreadPoolExecutor
17
+ from datetime import datetime
18
+ import re
19
+ from rich.progress import Progress, SpinnerColumn, TextColumn, BarColumn, TaskProgressColumn
20
+ from rich.console import Console
21
+ import aiofiles
22
+ import httpx
23
+
24
+ console = Console()
25
+
26
+ def slugify(url: str) -> str:
27
+ """Convert URL to a valid filename."""
28
+ parsed = urlparse(url)
29
+ # Combine domain and path, remove scheme and special characters
30
+ slug = f"{parsed.netloc}{parsed.path}"
31
+ slug = re.sub(r'[^\w\-]', '_', slug)
32
+ slug = re.sub(r'_+', '_', slug) # Replace multiple underscores with single
33
+ return slug.strip('_')
34
+
35
+ @dataclass
36
+ class CrawlResult:
37
+ """Store results of a crawl operation."""
38
+ url: str
39
+ status: str # 'success' or 'failed'
40
+ markdown: Optional[str] = None
41
+ error: Optional[str] = None
42
+ timestamp: str = None
43
+ config: Dict = None
44
+
45
+ def __post_init__(self):
46
+ if not self.timestamp:
47
+ self.timestamp = datetime.now().isoformat()
48
+
49
+ @dataclass
50
+ class CrawlConfig:
51
+ """Configuration for crawling settings."""
52
+ target_selector: Optional[str] = None # Optional - specific element to target
53
+ remove_selectors: Optional[List[str]] = None # Optional - elements to remove
54
+ remove_selectors_regex: Optional[List[str]] = None # Optional - regex patterns for removal
55
+ max_concurrent_requests: int = 1 # Default to single thread
56
+ request_delay: float = 0.5 # Delay between requests
57
+ timeout: int = 30 # Request timeout
58
+ output_dir: Path = Path("spiderforce_reports") # Default to spiderforce_reports in current directory
59
+ webhook_url: Optional[str] = None # Optional webhook endpoint
60
+ webhook_timeout: int = 10 # Webhook timeout
61
+ report_file: Optional[Path] = None # Optional report file location
62
+
63
+ def __post_init__(self):
64
+ # Initialize empty lists for selectors if None
65
+ self.remove_selectors = self.remove_selectors or []
66
+ self.remove_selectors_regex = self.remove_selectors_regex or []
67
+
68
+ # Ensure output_dir is a Path and exists
69
+ self.output_dir = Path(self.output_dir)
70
+ self.output_dir.mkdir(parents=True, exist_ok=True)
71
+
72
+ # If report_file is not specified, create it in output_dir
73
+ if self.report_file is None:
74
+ self.report_file = self.output_dir / "crawl_report.json"
75
+ else:
76
+ self.report_file = Path(self.report_file)
77
+
78
+ def to_dict(self) -> Dict:
79
+ """Convert config to dictionary for API requests."""
80
+ payload = {}
81
+ # Only include selectors if they are set
82
+ if self.target_selector:
83
+ payload["target_selector"] = self.target_selector
84
+ if self.remove_selectors:
85
+ payload["remove_selectors"] = self.remove_selectors
86
+ if self.remove_selectors_regex:
87
+ payload["remove_selectors_regex"] = self.remove_selectors_regex
88
+ return payload
89
+
90
+ class SpiderForce4AI:
91
+ """Main class for interacting with SpiderForce4AI service."""
92
+
93
+ def __init__(self, base_url: str):
94
+ self.base_url = base_url.rstrip('/')
95
+ self.session = None
96
+ self._executor = ThreadPoolExecutor()
97
+ self.crawl_results: List[CrawlResult] = []
98
+
99
+ async def _ensure_session(self):
100
+ """Ensure aiohttp session exists."""
101
+ if self.session is None or self.session.closed:
102
+ self.session = aiohttp.ClientSession()
103
+
104
+ async def _close_session(self):
105
+ """Close aiohttp session."""
106
+ if self.session and not self.session.closed:
107
+ await self.session.close()
108
+
109
+ async def _save_markdown(self, url: str, markdown: str, output_dir: Path):
110
+ """Save markdown content to file."""
111
+ filename = f"{slugify(url)}.md"
112
+ filepath = output_dir / filename
113
+ async with aiofiles.open(filepath, 'w', encoding='utf-8') as f:
114
+ await f.write(markdown)
115
+ return filepath
116
+
117
+ async def _send_webhook(self, result: CrawlResult, config: CrawlConfig):
118
+ """Send webhook with crawl results."""
119
+ if not config.webhook_url:
120
+ return
121
+
122
+ payload = {
123
+ "url": result.url,
124
+ "status": result.status,
125
+ "markdown": result.markdown if result.status == "success" else None,
126
+ "error": result.error if result.status == "failed" else None,
127
+ "timestamp": result.timestamp,
128
+ "config": config.to_dict()
129
+ }
130
+
131
+ try:
132
+ async with httpx.AsyncClient() as client:
133
+ response = await client.post(
134
+ config.webhook_url,
135
+ json=payload,
136
+ timeout=config.webhook_timeout
137
+ )
138
+ response.raise_for_status()
139
+ except Exception as e:
140
+ console.print(f"[yellow]Warning: Failed to send webhook for {result.url}: {str(e)}[/yellow]")
141
+
142
+ async def _save_report(self, config: CrawlConfig):
143
+ """Save crawl report to JSON file."""
144
+ if not config.report_file:
145
+ return
146
+
147
+ report = {
148
+ "timestamp": datetime.now().isoformat(),
149
+ "config": config.to_dict(),
150
+ "results": {
151
+ "successful": [asdict(r) for r in self.crawl_results if r.status == "success"],
152
+ "failed": [asdict(r) for r in self.crawl_results if r.status == "failed"]
153
+ },
154
+ "summary": {
155
+ "total": len(self.crawl_results),
156
+ "successful": len([r for r in self.crawl_results if r.status == "success"]),
157
+ "failed": len([r for r in self.crawl_results if r.status == "failed"])
158
+ }
159
+ }
160
+
161
+ async with aiofiles.open(config.report_file, 'w', encoding='utf-8') as f:
162
+ await f.write(json.dumps(report, indent=2))
163
+
164
+ async def crawl_url_async(self, url: str, config: CrawlConfig) -> CrawlResult:
165
+ """Crawl a single URL asynchronously."""
166
+ await self._ensure_session()
167
+
168
+ try:
169
+ endpoint = f"{self.base_url}/convert"
170
+ payload = {
171
+ "url": url,
172
+ **config.to_dict()
173
+ }
174
+
175
+ async with self.session.post(endpoint, json=payload, timeout=config.timeout) as response:
176
+ if response.status != 200:
177
+ error_text = await response.text()
178
+ result = CrawlResult(
179
+ url=url,
180
+ status="failed",
181
+ error=f"HTTP {response.status}: {error_text}",
182
+ config=config.to_dict()
183
+ )
184
+ else:
185
+ markdown = await response.text()
186
+ result = CrawlResult(
187
+ url=url,
188
+ status="success",
189
+ markdown=markdown,
190
+ config=config.to_dict()
191
+ )
192
+
193
+ if config.output_dir:
194
+ await self._save_markdown(url, markdown, config.output_dir)
195
+
196
+ await self._send_webhook(result, config)
197
+
198
+ self.crawl_results.append(result)
199
+ return result
200
+
201
+ except Exception as e:
202
+ result = CrawlResult(
203
+ url=url,
204
+ status="failed",
205
+ error=str(e),
206
+ config=config.to_dict()
207
+ )
208
+ self.crawl_results.append(result)
209
+ return result
210
+
211
+ def crawl_url(self, url: str, config: CrawlConfig) -> CrawlResult:
212
+ """Synchronous version of crawl_url_async."""
213
+ return asyncio.run(self.crawl_url_async(url, config))
214
+
215
+ async def crawl_urls_async(self, urls: List[str], config: CrawlConfig) -> List[CrawlResult]:
216
+ """Crawl multiple URLs asynchronously with progress bar."""
217
+ await self._ensure_session()
218
+
219
+ with Progress(
220
+ SpinnerColumn(),
221
+ TextColumn("[progress.description]{task.description}"),
222
+ BarColumn(),
223
+ TaskProgressColumn(),
224
+ console=console
225
+ ) as progress:
226
+ task = progress.add_task("[cyan]Crawling URLs...", total=len(urls))
227
+
228
+ async def crawl_with_progress(url):
229
+ result = await self.crawl_url_async(url, config)
230
+ progress.update(task, advance=1, description=f"[cyan]Crawled: {url}")
231
+ return result
232
+
233
+ semaphore = asyncio.Semaphore(config.max_concurrent_requests)
234
+ async def crawl_with_semaphore(url):
235
+ async with semaphore:
236
+ result = await crawl_with_progress(url)
237
+ await asyncio.sleep(config.request_delay)
238
+ return result
239
+
240
+ results = await asyncio.gather(*[crawl_with_semaphore(url) for url in urls])
241
+
242
+ # Save final report
243
+ await self._save_report(config)
244
+
245
+ # Print summary
246
+ successful = len([r for r in results if r.status == "success"])
247
+ failed = len([r for r in results if r.status == "failed"])
248
+ console.print(f"\n[green]Crawling completed:[/green]")
249
+ console.print(f"✓ Successful: {successful}")
250
+ console.print(f"✗ Failed: {failed}")
251
+
252
+ if config.report_file:
253
+ console.print(f"📊 Report saved to: {config.report_file}")
254
+
255
+ return results
256
+
257
+ def crawl_urls(self, urls: List[str], config: CrawlConfig) -> List[CrawlResult]:
258
+ """Synchronous version of crawl_urls_async."""
259
+ return asyncio.run(self.crawl_urls_async(urls, config))
260
+
261
+ async def crawl_sitemap_async(self, sitemap_url: str, config: CrawlConfig) -> List[CrawlResult]:
262
+ """Crawl URLs from a sitemap asynchronously."""
263
+ await self._ensure_session()
264
+
265
+ try:
266
+ console.print(f"[cyan]Fetching sitemap from {sitemap_url}...[/cyan]")
267
+ async with self.session.get(sitemap_url, timeout=config.timeout) as response:
268
+ sitemap_text = await response.text()
269
+ except Exception as e:
270
+ console.print(f"[red]Error fetching sitemap: {str(e)}[/red]")
271
+ raise
272
+
273
+ try:
274
+ root = ET.fromstring(sitemap_text)
275
+ namespace = {'ns': root.tag.split('}')[0].strip('{')}
276
+ urls = [loc.text for loc in root.findall('.//ns:loc', namespace)]
277
+ console.print(f"[green]Found {len(urls)} URLs in sitemap[/green]")
278
+ except Exception as e:
279
+ console.print(f"[red]Error parsing sitemap: {str(e)}[/red]")
280
+ raise
281
+
282
+ return await self.crawl_urls_async(urls, config)
283
+
284
+ def crawl_sitemap(self, sitemap_url: str, config: CrawlConfig) -> List[CrawlResult]:
285
+ """Synchronous version of crawl_sitemap_async."""
286
+ return asyncio.run(self.crawl_sitemap_async(sitemap_url, config))
287
+
288
+ async def __aenter__(self):
289
+ """Async context manager entry."""
290
+ await self._ensure_session()
291
+ return self
292
+
293
+ async def __aexit__(self, exc_type, exc_val, exc_tb):
294
+ """Async context manager exit."""
295
+ await self._close_session()
296
+
297
+ def __enter__(self):
298
+ """Sync context manager entry."""
299
+ return self
300
+
301
+ def __exit__(self, exc_type, exc_val, exc_tb):
302
+ """Sync context manager exit."""
303
+ self._executor.shutdown(wait=True)
@@ -0,0 +1,239 @@
1
+ Metadata-Version: 2.2
2
+ Name: spiderforce4ai
3
+ Version: 0.1.0
4
+ Summary: Python wrapper for SpiderForce4AI HTML-to-Markdown conversion service
5
+ Home-page: https://petertam.pro
6
+ Author: Piotr Tamulewicz
7
+ Author-email: Piotr Tamulewicz <pt@petertam.pro>
8
+ License: MIT
9
+ Classifier: Development Status :: 4 - Beta
10
+ Classifier: Intended Audience :: Developers
11
+ Classifier: License :: OSI Approved :: MIT License
12
+ Classifier: Programming Language :: Python :: 3.11
13
+ Classifier: Programming Language :: Python :: 3.12
14
+ Requires-Python: >=3.11
15
+ Description-Content-Type: text/markdown
16
+ Requires-Dist: aiohttp>=3.8.0
17
+ Requires-Dist: asyncio>=3.4.3
18
+ Requires-Dist: rich>=10.0.0
19
+ Requires-Dist: aiofiles>=0.8.0
20
+ Requires-Dist: httpx>=0.24.0
21
+ Dynamic: author
22
+ Dynamic: home-page
23
+ Dynamic: requires-python
24
+
25
+ # SpiderForce4AI Python Wrapper
26
+
27
+ A Python wrapper for SpiderForce4AI - a powerful HTML-to-Markdown conversion service. This package provides an easy-to-use interface for crawling websites and converting their content to clean Markdown format.
28
+
29
+ ## Features
30
+
31
+ - 🔄 Simple synchronous and asynchronous APIs
32
+ - 📁 Automatic Markdown file saving with URL-based filenames
33
+ - 📊 Real-time progress tracking in console
34
+ - 🪝 Webhook support for real-time notifications
35
+ - 📝 Detailed crawl reports in JSON format
36
+ - ⚡ Concurrent crawling with rate limiting
37
+ - 🔍 Support for sitemap.xml crawling
38
+ - 🛡️ Comprehensive error handling
39
+
40
+ ## Installation
41
+
42
+ ```bash
43
+ pip install spiderforce4ai
44
+ ```
45
+
46
+ ## Quick Start
47
+
48
+ ```python
49
+ from spiderforce4ai import SpiderForce4AI, CrawlConfig
50
+
51
+ # Initialize the client
52
+ spider = SpiderForce4AI("http://localhost:3004")
53
+
54
+ # Use default configuration
55
+ config = CrawlConfig()
56
+
57
+ # Crawl a single URL
58
+ result = spider.crawl_url("https://example.com", config)
59
+
60
+ # Crawl multiple URLs
61
+ urls = [
62
+ "https://example.com/page1",
63
+ "https://example.com/page2"
64
+ ]
65
+ results = spider.crawl_urls(urls, config)
66
+
67
+ # Crawl from sitemap
68
+ results = spider.crawl_sitemap("https://example.com/sitemap.xml", config)
69
+ ```
70
+
71
+ ## Configuration
72
+
73
+ The `CrawlConfig` class provides various configuration options. All parameters are optional with sensible defaults:
74
+
75
+ ```python
76
+ config = CrawlConfig(
77
+ # Content Selection (all optional)
78
+ target_selector="article", # Specific element to target
79
+ remove_selectors=[".ads", "#popup"], # Elements to remove
80
+ remove_selectors_regex=["modal-\\d+"], # Regex patterns for removal
81
+
82
+ # Processing Settings
83
+ max_concurrent_requests=1, # Default: 1
84
+ request_delay=0.5, # Delay between requests in seconds
85
+ timeout=30, # Request timeout in seconds
86
+
87
+ # Output Settings
88
+ output_dir="spiderforce_reports", # Default output directory
89
+ webhook_url="https://your-webhook.com", # Optional webhook endpoint
90
+ webhook_timeout=10, # Webhook timeout in seconds
91
+ report_file=None # Optional custom report location
92
+ )
93
+ ```
94
+
95
+ ### Default Directory Structure
96
+
97
+ ```
98
+ ./
99
+ └── spiderforce_reports/
100
+ ├── example-com-page1.md
101
+ ├── example-com-page2.md
102
+ └── crawl_report.json
103
+ ```
104
+
105
+ ## Webhook Notifications
106
+
107
+ If `webhook_url` is configured, the crawler sends POST requests with the following JSON structure:
108
+
109
+ ```json
110
+ {
111
+ "url": "https://example.com/page1",
112
+ "status": "success",
113
+ "markdown": "# Page Title\n\nContent...",
114
+ "timestamp": "2025-02-15T10:30:00.123456",
115
+ "config": {
116
+ "target_selector": "article",
117
+ "remove_selectors": [".ads", "#popup"],
118
+ "remove_selectors_regex": ["modal-\\d+"]
119
+ }
120
+ }
121
+ ```
122
+
123
+ ## Crawl Report
124
+
125
+ A comprehensive JSON report is automatically generated in the output directory:
126
+
127
+ ```json
128
+ {
129
+ "timestamp": "2025-02-15T10:30:00.123456",
130
+ "config": {
131
+ "target_selector": "article",
132
+ "remove_selectors": [".ads", "#popup"],
133
+ "remove_selectors_regex": ["modal-\\d+"]
134
+ },
135
+ "results": {
136
+ "successful": [
137
+ {
138
+ "url": "https://example.com/page1",
139
+ "status": "success",
140
+ "markdown": "# Page Title\n\nContent...",
141
+ "timestamp": "2025-02-15T10:30:00.123456"
142
+ }
143
+ ],
144
+ "failed": [
145
+ {
146
+ "url": "https://example.com/page2",
147
+ "status": "failed",
148
+ "error": "HTTP 404: Not Found",
149
+ "timestamp": "2025-02-15T10:30:01.123456"
150
+ }
151
+ ]
152
+ },
153
+ "summary": {
154
+ "total": 2,
155
+ "successful": 1,
156
+ "failed": 1
157
+ }
158
+ }
159
+ ```
160
+
161
+ ## Async Usage
162
+
163
+ ```python
164
+ import asyncio
165
+ from spiderforce4ai import SpiderForce4AI, CrawlConfig
166
+
167
+ async def main():
168
+ config = CrawlConfig()
169
+ spider = SpiderForce4AI("http://localhost:3004")
170
+
171
+ async with spider:
172
+ results = await spider.crawl_urls_async(
173
+ ["https://example.com/page1", "https://example.com/page2"],
174
+ config
175
+ )
176
+
177
+ return results
178
+
179
+ if __name__ == "__main__":
180
+ results = asyncio.run(main())
181
+ ```
182
+
183
+ ## Error Handling
184
+
185
+ The crawler is designed to be resilient:
186
+ - Continues processing even if some URLs fail
187
+ - Records all errors in the crawl report
188
+ - Sends error notifications via webhook if configured
189
+ - Provides clear error messages in console output
190
+
191
+ ## Progress Tracking
192
+
193
+ The crawler provides real-time progress tracking in the console:
194
+
195
+ ```
196
+ 🔄 Crawling URLs... [####################] 100%
197
+ ✓ Successful: 95
198
+ ✗ Failed: 5
199
+ 📊 Report saved to: ./spiderforce_reports/crawl_report.json
200
+ ```
201
+
202
+ ## Usage with AI Agents
203
+
204
+ The package is designed to be easily integrated with AI agents and chat systems:
205
+
206
+ ```python
207
+ from spiderforce4ai import SpiderForce4AI, CrawlConfig
208
+
209
+ def fetch_content_for_ai(urls):
210
+ spider = SpiderForce4AI("http://localhost:3004")
211
+ config = CrawlConfig()
212
+
213
+ # Crawl content
214
+ results = spider.crawl_urls(urls, config)
215
+
216
+ # Return successful results
217
+ return {
218
+ result.url: result.markdown
219
+ for result in results
220
+ if result.status == "success"
221
+ }
222
+
223
+ # Use with AI agent
224
+ urls = ["https://example.com/article1", "https://example.com/article2"]
225
+ content = fetch_content_for_ai(urls)
226
+ ```
227
+
228
+ ## Requirements
229
+
230
+ - Python 3.11 or later
231
+ - Docker (for running SpiderForce4AI service)
232
+
233
+ ## License
234
+
235
+ MIT License
236
+
237
+ ## Credits
238
+
239
+ Created by [Peter Tam](https://petertam.pro)
@@ -0,0 +1,9 @@
1
+ README.md
2
+ pyproject.toml
3
+ setup.py
4
+ spiderforce4ai/__init__.py
5
+ spiderforce4ai.egg-info/PKG-INFO
6
+ spiderforce4ai.egg-info/SOURCES.txt
7
+ spiderforce4ai.egg-info/dependency_links.txt
8
+ spiderforce4ai.egg-info/requires.txt
9
+ spiderforce4ai.egg-info/top_level.txt
@@ -0,0 +1,5 @@
1
+ aiohttp>=3.8.0
2
+ asyncio>=3.4.3
3
+ rich>=10.0.0
4
+ aiofiles>=0.8.0
5
+ httpx>=0.24.0
@@ -0,0 +1 @@
1
+ spiderforce4ai