spiderforce4ai 0.1.0__tar.gz
Sign up to get free protection for your applications and to get access to all the features.
- spiderforce4ai-0.1.0/PKG-INFO +239 -0
- spiderforce4ai-0.1.0/README.md +215 -0
- spiderforce4ai-0.1.0/pyproject.toml +26 -0
- spiderforce4ai-0.1.0/setup.cfg +4 -0
- spiderforce4ai-0.1.0/setup.py +29 -0
- spiderforce4ai-0.1.0/spiderforce4ai/__init__.py +303 -0
- spiderforce4ai-0.1.0/spiderforce4ai.egg-info/PKG-INFO +239 -0
- spiderforce4ai-0.1.0/spiderforce4ai.egg-info/SOURCES.txt +9 -0
- spiderforce4ai-0.1.0/spiderforce4ai.egg-info/dependency_links.txt +1 -0
- spiderforce4ai-0.1.0/spiderforce4ai.egg-info/requires.txt +5 -0
- spiderforce4ai-0.1.0/spiderforce4ai.egg-info/top_level.txt +1 -0
@@ -0,0 +1,239 @@
|
|
1
|
+
Metadata-Version: 2.2
|
2
|
+
Name: spiderforce4ai
|
3
|
+
Version: 0.1.0
|
4
|
+
Summary: Python wrapper for SpiderForce4AI HTML-to-Markdown conversion service
|
5
|
+
Home-page: https://petertam.pro
|
6
|
+
Author: Piotr Tamulewicz
|
7
|
+
Author-email: Piotr Tamulewicz <pt@petertam.pro>
|
8
|
+
License: MIT
|
9
|
+
Classifier: Development Status :: 4 - Beta
|
10
|
+
Classifier: Intended Audience :: Developers
|
11
|
+
Classifier: License :: OSI Approved :: MIT License
|
12
|
+
Classifier: Programming Language :: Python :: 3.11
|
13
|
+
Classifier: Programming Language :: Python :: 3.12
|
14
|
+
Requires-Python: >=3.11
|
15
|
+
Description-Content-Type: text/markdown
|
16
|
+
Requires-Dist: aiohttp>=3.8.0
|
17
|
+
Requires-Dist: asyncio>=3.4.3
|
18
|
+
Requires-Dist: rich>=10.0.0
|
19
|
+
Requires-Dist: aiofiles>=0.8.0
|
20
|
+
Requires-Dist: httpx>=0.24.0
|
21
|
+
Dynamic: author
|
22
|
+
Dynamic: home-page
|
23
|
+
Dynamic: requires-python
|
24
|
+
|
25
|
+
# SpiderForce4AI Python Wrapper
|
26
|
+
|
27
|
+
A Python wrapper for SpiderForce4AI - a powerful HTML-to-Markdown conversion service. This package provides an easy-to-use interface for crawling websites and converting their content to clean Markdown format.
|
28
|
+
|
29
|
+
## Features
|
30
|
+
|
31
|
+
- 🔄 Simple synchronous and asynchronous APIs
|
32
|
+
- 📁 Automatic Markdown file saving with URL-based filenames
|
33
|
+
- 📊 Real-time progress tracking in console
|
34
|
+
- 🪝 Webhook support for real-time notifications
|
35
|
+
- 📝 Detailed crawl reports in JSON format
|
36
|
+
- ⚡ Concurrent crawling with rate limiting
|
37
|
+
- 🔍 Support for sitemap.xml crawling
|
38
|
+
- 🛡️ Comprehensive error handling
|
39
|
+
|
40
|
+
## Installation
|
41
|
+
|
42
|
+
```bash
|
43
|
+
pip install spiderforce4ai
|
44
|
+
```
|
45
|
+
|
46
|
+
## Quick Start
|
47
|
+
|
48
|
+
```python
|
49
|
+
from spiderforce4ai import SpiderForce4AI, CrawlConfig
|
50
|
+
|
51
|
+
# Initialize the client
|
52
|
+
spider = SpiderForce4AI("http://localhost:3004")
|
53
|
+
|
54
|
+
# Use default configuration
|
55
|
+
config = CrawlConfig()
|
56
|
+
|
57
|
+
# Crawl a single URL
|
58
|
+
result = spider.crawl_url("https://example.com", config)
|
59
|
+
|
60
|
+
# Crawl multiple URLs
|
61
|
+
urls = [
|
62
|
+
"https://example.com/page1",
|
63
|
+
"https://example.com/page2"
|
64
|
+
]
|
65
|
+
results = spider.crawl_urls(urls, config)
|
66
|
+
|
67
|
+
# Crawl from sitemap
|
68
|
+
results = spider.crawl_sitemap("https://example.com/sitemap.xml", config)
|
69
|
+
```
|
70
|
+
|
71
|
+
## Configuration
|
72
|
+
|
73
|
+
The `CrawlConfig` class provides various configuration options. All parameters are optional with sensible defaults:
|
74
|
+
|
75
|
+
```python
|
76
|
+
config = CrawlConfig(
|
77
|
+
# Content Selection (all optional)
|
78
|
+
target_selector="article", # Specific element to target
|
79
|
+
remove_selectors=[".ads", "#popup"], # Elements to remove
|
80
|
+
remove_selectors_regex=["modal-\\d+"], # Regex patterns for removal
|
81
|
+
|
82
|
+
# Processing Settings
|
83
|
+
max_concurrent_requests=1, # Default: 1
|
84
|
+
request_delay=0.5, # Delay between requests in seconds
|
85
|
+
timeout=30, # Request timeout in seconds
|
86
|
+
|
87
|
+
# Output Settings
|
88
|
+
output_dir="spiderforce_reports", # Default output directory
|
89
|
+
webhook_url="https://your-webhook.com", # Optional webhook endpoint
|
90
|
+
webhook_timeout=10, # Webhook timeout in seconds
|
91
|
+
report_file=None # Optional custom report location
|
92
|
+
)
|
93
|
+
```
|
94
|
+
|
95
|
+
### Default Directory Structure
|
96
|
+
|
97
|
+
```
|
98
|
+
./
|
99
|
+
└── spiderforce_reports/
|
100
|
+
├── example-com-page1.md
|
101
|
+
├── example-com-page2.md
|
102
|
+
└── crawl_report.json
|
103
|
+
```
|
104
|
+
|
105
|
+
## Webhook Notifications
|
106
|
+
|
107
|
+
If `webhook_url` is configured, the crawler sends POST requests with the following JSON structure:
|
108
|
+
|
109
|
+
```json
|
110
|
+
{
|
111
|
+
"url": "https://example.com/page1",
|
112
|
+
"status": "success",
|
113
|
+
"markdown": "# Page Title\n\nContent...",
|
114
|
+
"timestamp": "2025-02-15T10:30:00.123456",
|
115
|
+
"config": {
|
116
|
+
"target_selector": "article",
|
117
|
+
"remove_selectors": [".ads", "#popup"],
|
118
|
+
"remove_selectors_regex": ["modal-\\d+"]
|
119
|
+
}
|
120
|
+
}
|
121
|
+
```
|
122
|
+
|
123
|
+
## Crawl Report
|
124
|
+
|
125
|
+
A comprehensive JSON report is automatically generated in the output directory:
|
126
|
+
|
127
|
+
```json
|
128
|
+
{
|
129
|
+
"timestamp": "2025-02-15T10:30:00.123456",
|
130
|
+
"config": {
|
131
|
+
"target_selector": "article",
|
132
|
+
"remove_selectors": [".ads", "#popup"],
|
133
|
+
"remove_selectors_regex": ["modal-\\d+"]
|
134
|
+
},
|
135
|
+
"results": {
|
136
|
+
"successful": [
|
137
|
+
{
|
138
|
+
"url": "https://example.com/page1",
|
139
|
+
"status": "success",
|
140
|
+
"markdown": "# Page Title\n\nContent...",
|
141
|
+
"timestamp": "2025-02-15T10:30:00.123456"
|
142
|
+
}
|
143
|
+
],
|
144
|
+
"failed": [
|
145
|
+
{
|
146
|
+
"url": "https://example.com/page2",
|
147
|
+
"status": "failed",
|
148
|
+
"error": "HTTP 404: Not Found",
|
149
|
+
"timestamp": "2025-02-15T10:30:01.123456"
|
150
|
+
}
|
151
|
+
]
|
152
|
+
},
|
153
|
+
"summary": {
|
154
|
+
"total": 2,
|
155
|
+
"successful": 1,
|
156
|
+
"failed": 1
|
157
|
+
}
|
158
|
+
}
|
159
|
+
```
|
160
|
+
|
161
|
+
## Async Usage
|
162
|
+
|
163
|
+
```python
|
164
|
+
import asyncio
|
165
|
+
from spiderforce4ai import SpiderForce4AI, CrawlConfig
|
166
|
+
|
167
|
+
async def main():
|
168
|
+
config = CrawlConfig()
|
169
|
+
spider = SpiderForce4AI("http://localhost:3004")
|
170
|
+
|
171
|
+
async with spider:
|
172
|
+
results = await spider.crawl_urls_async(
|
173
|
+
["https://example.com/page1", "https://example.com/page2"],
|
174
|
+
config
|
175
|
+
)
|
176
|
+
|
177
|
+
return results
|
178
|
+
|
179
|
+
if __name__ == "__main__":
|
180
|
+
results = asyncio.run(main())
|
181
|
+
```
|
182
|
+
|
183
|
+
## Error Handling
|
184
|
+
|
185
|
+
The crawler is designed to be resilient:
|
186
|
+
- Continues processing even if some URLs fail
|
187
|
+
- Records all errors in the crawl report
|
188
|
+
- Sends error notifications via webhook if configured
|
189
|
+
- Provides clear error messages in console output
|
190
|
+
|
191
|
+
## Progress Tracking
|
192
|
+
|
193
|
+
The crawler provides real-time progress tracking in the console:
|
194
|
+
|
195
|
+
```
|
196
|
+
🔄 Crawling URLs... [####################] 100%
|
197
|
+
✓ Successful: 95
|
198
|
+
✗ Failed: 5
|
199
|
+
📊 Report saved to: ./spiderforce_reports/crawl_report.json
|
200
|
+
```
|
201
|
+
|
202
|
+
## Usage with AI Agents
|
203
|
+
|
204
|
+
The package is designed to be easily integrated with AI agents and chat systems:
|
205
|
+
|
206
|
+
```python
|
207
|
+
from spiderforce4ai import SpiderForce4AI, CrawlConfig
|
208
|
+
|
209
|
+
def fetch_content_for_ai(urls):
|
210
|
+
spider = SpiderForce4AI("http://localhost:3004")
|
211
|
+
config = CrawlConfig()
|
212
|
+
|
213
|
+
# Crawl content
|
214
|
+
results = spider.crawl_urls(urls, config)
|
215
|
+
|
216
|
+
# Return successful results
|
217
|
+
return {
|
218
|
+
result.url: result.markdown
|
219
|
+
for result in results
|
220
|
+
if result.status == "success"
|
221
|
+
}
|
222
|
+
|
223
|
+
# Use with AI agent
|
224
|
+
urls = ["https://example.com/article1", "https://example.com/article2"]
|
225
|
+
content = fetch_content_for_ai(urls)
|
226
|
+
```
|
227
|
+
|
228
|
+
## Requirements
|
229
|
+
|
230
|
+
- Python 3.11 or later
|
231
|
+
- Docker (for running SpiderForce4AI service)
|
232
|
+
|
233
|
+
## License
|
234
|
+
|
235
|
+
MIT License
|
236
|
+
|
237
|
+
## Credits
|
238
|
+
|
239
|
+
Created by [Peter Tam](https://petertam.pro)
|
@@ -0,0 +1,215 @@
|
|
1
|
+
# SpiderForce4AI Python Wrapper
|
2
|
+
|
3
|
+
A Python wrapper for SpiderForce4AI - a powerful HTML-to-Markdown conversion service. This package provides an easy-to-use interface for crawling websites and converting their content to clean Markdown format.
|
4
|
+
|
5
|
+
## Features
|
6
|
+
|
7
|
+
- 🔄 Simple synchronous and asynchronous APIs
|
8
|
+
- 📁 Automatic Markdown file saving with URL-based filenames
|
9
|
+
- 📊 Real-time progress tracking in console
|
10
|
+
- 🪝 Webhook support for real-time notifications
|
11
|
+
- 📝 Detailed crawl reports in JSON format
|
12
|
+
- ⚡ Concurrent crawling with rate limiting
|
13
|
+
- 🔍 Support for sitemap.xml crawling
|
14
|
+
- 🛡️ Comprehensive error handling
|
15
|
+
|
16
|
+
## Installation
|
17
|
+
|
18
|
+
```bash
|
19
|
+
pip install spiderforce4ai
|
20
|
+
```
|
21
|
+
|
22
|
+
## Quick Start
|
23
|
+
|
24
|
+
```python
|
25
|
+
from spiderforce4ai import SpiderForce4AI, CrawlConfig
|
26
|
+
|
27
|
+
# Initialize the client
|
28
|
+
spider = SpiderForce4AI("http://localhost:3004")
|
29
|
+
|
30
|
+
# Use default configuration
|
31
|
+
config = CrawlConfig()
|
32
|
+
|
33
|
+
# Crawl a single URL
|
34
|
+
result = spider.crawl_url("https://example.com", config)
|
35
|
+
|
36
|
+
# Crawl multiple URLs
|
37
|
+
urls = [
|
38
|
+
"https://example.com/page1",
|
39
|
+
"https://example.com/page2"
|
40
|
+
]
|
41
|
+
results = spider.crawl_urls(urls, config)
|
42
|
+
|
43
|
+
# Crawl from sitemap
|
44
|
+
results = spider.crawl_sitemap("https://example.com/sitemap.xml", config)
|
45
|
+
```
|
46
|
+
|
47
|
+
## Configuration
|
48
|
+
|
49
|
+
The `CrawlConfig` class provides various configuration options. All parameters are optional with sensible defaults:
|
50
|
+
|
51
|
+
```python
|
52
|
+
config = CrawlConfig(
|
53
|
+
# Content Selection (all optional)
|
54
|
+
target_selector="article", # Specific element to target
|
55
|
+
remove_selectors=[".ads", "#popup"], # Elements to remove
|
56
|
+
remove_selectors_regex=["modal-\\d+"], # Regex patterns for removal
|
57
|
+
|
58
|
+
# Processing Settings
|
59
|
+
max_concurrent_requests=1, # Default: 1
|
60
|
+
request_delay=0.5, # Delay between requests in seconds
|
61
|
+
timeout=30, # Request timeout in seconds
|
62
|
+
|
63
|
+
# Output Settings
|
64
|
+
output_dir="spiderforce_reports", # Default output directory
|
65
|
+
webhook_url="https://your-webhook.com", # Optional webhook endpoint
|
66
|
+
webhook_timeout=10, # Webhook timeout in seconds
|
67
|
+
report_file=None # Optional custom report location
|
68
|
+
)
|
69
|
+
```
|
70
|
+
|
71
|
+
### Default Directory Structure
|
72
|
+
|
73
|
+
```
|
74
|
+
./
|
75
|
+
└── spiderforce_reports/
|
76
|
+
├── example-com-page1.md
|
77
|
+
├── example-com-page2.md
|
78
|
+
└── crawl_report.json
|
79
|
+
```
|
80
|
+
|
81
|
+
## Webhook Notifications
|
82
|
+
|
83
|
+
If `webhook_url` is configured, the crawler sends POST requests with the following JSON structure:
|
84
|
+
|
85
|
+
```json
|
86
|
+
{
|
87
|
+
"url": "https://example.com/page1",
|
88
|
+
"status": "success",
|
89
|
+
"markdown": "# Page Title\n\nContent...",
|
90
|
+
"timestamp": "2025-02-15T10:30:00.123456",
|
91
|
+
"config": {
|
92
|
+
"target_selector": "article",
|
93
|
+
"remove_selectors": [".ads", "#popup"],
|
94
|
+
"remove_selectors_regex": ["modal-\\d+"]
|
95
|
+
}
|
96
|
+
}
|
97
|
+
```
|
98
|
+
|
99
|
+
## Crawl Report
|
100
|
+
|
101
|
+
A comprehensive JSON report is automatically generated in the output directory:
|
102
|
+
|
103
|
+
```json
|
104
|
+
{
|
105
|
+
"timestamp": "2025-02-15T10:30:00.123456",
|
106
|
+
"config": {
|
107
|
+
"target_selector": "article",
|
108
|
+
"remove_selectors": [".ads", "#popup"],
|
109
|
+
"remove_selectors_regex": ["modal-\\d+"]
|
110
|
+
},
|
111
|
+
"results": {
|
112
|
+
"successful": [
|
113
|
+
{
|
114
|
+
"url": "https://example.com/page1",
|
115
|
+
"status": "success",
|
116
|
+
"markdown": "# Page Title\n\nContent...",
|
117
|
+
"timestamp": "2025-02-15T10:30:00.123456"
|
118
|
+
}
|
119
|
+
],
|
120
|
+
"failed": [
|
121
|
+
{
|
122
|
+
"url": "https://example.com/page2",
|
123
|
+
"status": "failed",
|
124
|
+
"error": "HTTP 404: Not Found",
|
125
|
+
"timestamp": "2025-02-15T10:30:01.123456"
|
126
|
+
}
|
127
|
+
]
|
128
|
+
},
|
129
|
+
"summary": {
|
130
|
+
"total": 2,
|
131
|
+
"successful": 1,
|
132
|
+
"failed": 1
|
133
|
+
}
|
134
|
+
}
|
135
|
+
```
|
136
|
+
|
137
|
+
## Async Usage
|
138
|
+
|
139
|
+
```python
|
140
|
+
import asyncio
|
141
|
+
from spiderforce4ai import SpiderForce4AI, CrawlConfig
|
142
|
+
|
143
|
+
async def main():
|
144
|
+
config = CrawlConfig()
|
145
|
+
spider = SpiderForce4AI("http://localhost:3004")
|
146
|
+
|
147
|
+
async with spider:
|
148
|
+
results = await spider.crawl_urls_async(
|
149
|
+
["https://example.com/page1", "https://example.com/page2"],
|
150
|
+
config
|
151
|
+
)
|
152
|
+
|
153
|
+
return results
|
154
|
+
|
155
|
+
if __name__ == "__main__":
|
156
|
+
results = asyncio.run(main())
|
157
|
+
```
|
158
|
+
|
159
|
+
## Error Handling
|
160
|
+
|
161
|
+
The crawler is designed to be resilient:
|
162
|
+
- Continues processing even if some URLs fail
|
163
|
+
- Records all errors in the crawl report
|
164
|
+
- Sends error notifications via webhook if configured
|
165
|
+
- Provides clear error messages in console output
|
166
|
+
|
167
|
+
## Progress Tracking
|
168
|
+
|
169
|
+
The crawler provides real-time progress tracking in the console:
|
170
|
+
|
171
|
+
```
|
172
|
+
🔄 Crawling URLs... [####################] 100%
|
173
|
+
✓ Successful: 95
|
174
|
+
✗ Failed: 5
|
175
|
+
📊 Report saved to: ./spiderforce_reports/crawl_report.json
|
176
|
+
```
|
177
|
+
|
178
|
+
## Usage with AI Agents
|
179
|
+
|
180
|
+
The package is designed to be easily integrated with AI agents and chat systems:
|
181
|
+
|
182
|
+
```python
|
183
|
+
from spiderforce4ai import SpiderForce4AI, CrawlConfig
|
184
|
+
|
185
|
+
def fetch_content_for_ai(urls):
|
186
|
+
spider = SpiderForce4AI("http://localhost:3004")
|
187
|
+
config = CrawlConfig()
|
188
|
+
|
189
|
+
# Crawl content
|
190
|
+
results = spider.crawl_urls(urls, config)
|
191
|
+
|
192
|
+
# Return successful results
|
193
|
+
return {
|
194
|
+
result.url: result.markdown
|
195
|
+
for result in results
|
196
|
+
if result.status == "success"
|
197
|
+
}
|
198
|
+
|
199
|
+
# Use with AI agent
|
200
|
+
urls = ["https://example.com/article1", "https://example.com/article2"]
|
201
|
+
content = fetch_content_for_ai(urls)
|
202
|
+
```
|
203
|
+
|
204
|
+
## Requirements
|
205
|
+
|
206
|
+
- Python 3.11 or later
|
207
|
+
- Docker (for running SpiderForce4AI service)
|
208
|
+
|
209
|
+
## License
|
210
|
+
|
211
|
+
MIT License
|
212
|
+
|
213
|
+
## Credits
|
214
|
+
|
215
|
+
Created by [Peter Tam](https://petertam.pro)
|
@@ -0,0 +1,26 @@
|
|
1
|
+
[build-system]
|
2
|
+
requires = ["setuptools>=45", "wheel"]
|
3
|
+
build-backend = "setuptools.build_meta"
|
4
|
+
|
5
|
+
[project]
|
6
|
+
name = "spiderforce4ai"
|
7
|
+
version = "0.1.0"
|
8
|
+
description = "Python wrapper for SpiderForce4AI HTML-to-Markdown conversion service"
|
9
|
+
readme = "README.md"
|
10
|
+
authors = [{name = "Piotr Tamulewicz", email = "pt@petertam.pro"}]
|
11
|
+
license = {text = "MIT"}
|
12
|
+
classifiers = [
|
13
|
+
"Development Status :: 4 - Beta",
|
14
|
+
"Intended Audience :: Developers",
|
15
|
+
"License :: OSI Approved :: MIT License",
|
16
|
+
"Programming Language :: Python :: 3.11",
|
17
|
+
"Programming Language :: Python :: 3.12",
|
18
|
+
]
|
19
|
+
requires-python = ">=3.11"
|
20
|
+
dependencies = [
|
21
|
+
"aiohttp>=3.8.0",
|
22
|
+
"asyncio>=3.4.3",
|
23
|
+
"rich>=10.0.0",
|
24
|
+
"aiofiles>=0.8.0",
|
25
|
+
"httpx>=0.24.0"
|
26
|
+
]
|
@@ -0,0 +1,29 @@
|
|
1
|
+
# setup.py
|
2
|
+
from setuptools import setup, find_packages
|
3
|
+
|
4
|
+
setup(
|
5
|
+
name="spiderforce4ai",
|
6
|
+
version="0.1.0",
|
7
|
+
author="Piotr Tamulewicz",
|
8
|
+
author_email="pt@petertam.pro",
|
9
|
+
description="Python wrapper for SpiderForce4AI HTML-to-Markdown conversion service",
|
10
|
+
long_description=open("README.md").read(),
|
11
|
+
long_description_content_type="text/markdown",
|
12
|
+
url="https://petertam.pro",
|
13
|
+
packages=find_packages(),
|
14
|
+
classifiers=[
|
15
|
+
"Development Status :: 4 - Beta",
|
16
|
+
"Intended Audience :: Developers",
|
17
|
+
"License :: OSI Approved :: MIT License",
|
18
|
+
"Programming Language :: Python :: 3.11",
|
19
|
+
"Programming Language :: Python :: 3.12",
|
20
|
+
],
|
21
|
+
python_requires=">=3.11",
|
22
|
+
install_requires=[
|
23
|
+
"aiohttp>=3.8.0",
|
24
|
+
"asyncio>=3.4.3",
|
25
|
+
"rich>=10.0.0",
|
26
|
+
"aiofiles>=0.8.0",
|
27
|
+
"httpx>=0.24.0"
|
28
|
+
],
|
29
|
+
)
|
@@ -0,0 +1,303 @@
|
|
1
|
+
"""
|
2
|
+
SpiderForce4AI Python Wrapper
|
3
|
+
A Python package for interacting with SpiderForce4AI HTML-to-Markdown conversion service.
|
4
|
+
"""
|
5
|
+
|
6
|
+
import asyncio
|
7
|
+
import aiohttp
|
8
|
+
import json
|
9
|
+
import logging
|
10
|
+
from typing import List, Dict, Union, Optional
|
11
|
+
from dataclasses import dataclass, asdict
|
12
|
+
from urllib.parse import urljoin, urlparse
|
13
|
+
from pathlib import Path
|
14
|
+
import time
|
15
|
+
import xml.etree.ElementTree as ET
|
16
|
+
from concurrent.futures import ThreadPoolExecutor
|
17
|
+
from datetime import datetime
|
18
|
+
import re
|
19
|
+
from rich.progress import Progress, SpinnerColumn, TextColumn, BarColumn, TaskProgressColumn
|
20
|
+
from rich.console import Console
|
21
|
+
import aiofiles
|
22
|
+
import httpx
|
23
|
+
|
24
|
+
console = Console()
|
25
|
+
|
26
|
+
def slugify(url: str) -> str:
|
27
|
+
"""Convert URL to a valid filename."""
|
28
|
+
parsed = urlparse(url)
|
29
|
+
# Combine domain and path, remove scheme and special characters
|
30
|
+
slug = f"{parsed.netloc}{parsed.path}"
|
31
|
+
slug = re.sub(r'[^\w\-]', '_', slug)
|
32
|
+
slug = re.sub(r'_+', '_', slug) # Replace multiple underscores with single
|
33
|
+
return slug.strip('_')
|
34
|
+
|
35
|
+
@dataclass
|
36
|
+
class CrawlResult:
|
37
|
+
"""Store results of a crawl operation."""
|
38
|
+
url: str
|
39
|
+
status: str # 'success' or 'failed'
|
40
|
+
markdown: Optional[str] = None
|
41
|
+
error: Optional[str] = None
|
42
|
+
timestamp: str = None
|
43
|
+
config: Dict = None
|
44
|
+
|
45
|
+
def __post_init__(self):
|
46
|
+
if not self.timestamp:
|
47
|
+
self.timestamp = datetime.now().isoformat()
|
48
|
+
|
49
|
+
@dataclass
|
50
|
+
class CrawlConfig:
|
51
|
+
"""Configuration for crawling settings."""
|
52
|
+
target_selector: Optional[str] = None # Optional - specific element to target
|
53
|
+
remove_selectors: Optional[List[str]] = None # Optional - elements to remove
|
54
|
+
remove_selectors_regex: Optional[List[str]] = None # Optional - regex patterns for removal
|
55
|
+
max_concurrent_requests: int = 1 # Default to single thread
|
56
|
+
request_delay: float = 0.5 # Delay between requests
|
57
|
+
timeout: int = 30 # Request timeout
|
58
|
+
output_dir: Path = Path("spiderforce_reports") # Default to spiderforce_reports in current directory
|
59
|
+
webhook_url: Optional[str] = None # Optional webhook endpoint
|
60
|
+
webhook_timeout: int = 10 # Webhook timeout
|
61
|
+
report_file: Optional[Path] = None # Optional report file location
|
62
|
+
|
63
|
+
def __post_init__(self):
|
64
|
+
# Initialize empty lists for selectors if None
|
65
|
+
self.remove_selectors = self.remove_selectors or []
|
66
|
+
self.remove_selectors_regex = self.remove_selectors_regex or []
|
67
|
+
|
68
|
+
# Ensure output_dir is a Path and exists
|
69
|
+
self.output_dir = Path(self.output_dir)
|
70
|
+
self.output_dir.mkdir(parents=True, exist_ok=True)
|
71
|
+
|
72
|
+
# If report_file is not specified, create it in output_dir
|
73
|
+
if self.report_file is None:
|
74
|
+
self.report_file = self.output_dir / "crawl_report.json"
|
75
|
+
else:
|
76
|
+
self.report_file = Path(self.report_file)
|
77
|
+
|
78
|
+
def to_dict(self) -> Dict:
|
79
|
+
"""Convert config to dictionary for API requests."""
|
80
|
+
payload = {}
|
81
|
+
# Only include selectors if they are set
|
82
|
+
if self.target_selector:
|
83
|
+
payload["target_selector"] = self.target_selector
|
84
|
+
if self.remove_selectors:
|
85
|
+
payload["remove_selectors"] = self.remove_selectors
|
86
|
+
if self.remove_selectors_regex:
|
87
|
+
payload["remove_selectors_regex"] = self.remove_selectors_regex
|
88
|
+
return payload
|
89
|
+
|
90
|
+
class SpiderForce4AI:
|
91
|
+
"""Main class for interacting with SpiderForce4AI service."""
|
92
|
+
|
93
|
+
def __init__(self, base_url: str):
|
94
|
+
self.base_url = base_url.rstrip('/')
|
95
|
+
self.session = None
|
96
|
+
self._executor = ThreadPoolExecutor()
|
97
|
+
self.crawl_results: List[CrawlResult] = []
|
98
|
+
|
99
|
+
async def _ensure_session(self):
|
100
|
+
"""Ensure aiohttp session exists."""
|
101
|
+
if self.session is None or self.session.closed:
|
102
|
+
self.session = aiohttp.ClientSession()
|
103
|
+
|
104
|
+
async def _close_session(self):
|
105
|
+
"""Close aiohttp session."""
|
106
|
+
if self.session and not self.session.closed:
|
107
|
+
await self.session.close()
|
108
|
+
|
109
|
+
async def _save_markdown(self, url: str, markdown: str, output_dir: Path):
|
110
|
+
"""Save markdown content to file."""
|
111
|
+
filename = f"{slugify(url)}.md"
|
112
|
+
filepath = output_dir / filename
|
113
|
+
async with aiofiles.open(filepath, 'w', encoding='utf-8') as f:
|
114
|
+
await f.write(markdown)
|
115
|
+
return filepath
|
116
|
+
|
117
|
+
async def _send_webhook(self, result: CrawlResult, config: CrawlConfig):
|
118
|
+
"""Send webhook with crawl results."""
|
119
|
+
if not config.webhook_url:
|
120
|
+
return
|
121
|
+
|
122
|
+
payload = {
|
123
|
+
"url": result.url,
|
124
|
+
"status": result.status,
|
125
|
+
"markdown": result.markdown if result.status == "success" else None,
|
126
|
+
"error": result.error if result.status == "failed" else None,
|
127
|
+
"timestamp": result.timestamp,
|
128
|
+
"config": config.to_dict()
|
129
|
+
}
|
130
|
+
|
131
|
+
try:
|
132
|
+
async with httpx.AsyncClient() as client:
|
133
|
+
response = await client.post(
|
134
|
+
config.webhook_url,
|
135
|
+
json=payload,
|
136
|
+
timeout=config.webhook_timeout
|
137
|
+
)
|
138
|
+
response.raise_for_status()
|
139
|
+
except Exception as e:
|
140
|
+
console.print(f"[yellow]Warning: Failed to send webhook for {result.url}: {str(e)}[/yellow]")
|
141
|
+
|
142
|
+
async def _save_report(self, config: CrawlConfig):
|
143
|
+
"""Save crawl report to JSON file."""
|
144
|
+
if not config.report_file:
|
145
|
+
return
|
146
|
+
|
147
|
+
report = {
|
148
|
+
"timestamp": datetime.now().isoformat(),
|
149
|
+
"config": config.to_dict(),
|
150
|
+
"results": {
|
151
|
+
"successful": [asdict(r) for r in self.crawl_results if r.status == "success"],
|
152
|
+
"failed": [asdict(r) for r in self.crawl_results if r.status == "failed"]
|
153
|
+
},
|
154
|
+
"summary": {
|
155
|
+
"total": len(self.crawl_results),
|
156
|
+
"successful": len([r for r in self.crawl_results if r.status == "success"]),
|
157
|
+
"failed": len([r for r in self.crawl_results if r.status == "failed"])
|
158
|
+
}
|
159
|
+
}
|
160
|
+
|
161
|
+
async with aiofiles.open(config.report_file, 'w', encoding='utf-8') as f:
|
162
|
+
await f.write(json.dumps(report, indent=2))
|
163
|
+
|
164
|
+
async def crawl_url_async(self, url: str, config: CrawlConfig) -> CrawlResult:
|
165
|
+
"""Crawl a single URL asynchronously."""
|
166
|
+
await self._ensure_session()
|
167
|
+
|
168
|
+
try:
|
169
|
+
endpoint = f"{self.base_url}/convert"
|
170
|
+
payload = {
|
171
|
+
"url": url,
|
172
|
+
**config.to_dict()
|
173
|
+
}
|
174
|
+
|
175
|
+
async with self.session.post(endpoint, json=payload, timeout=config.timeout) as response:
|
176
|
+
if response.status != 200:
|
177
|
+
error_text = await response.text()
|
178
|
+
result = CrawlResult(
|
179
|
+
url=url,
|
180
|
+
status="failed",
|
181
|
+
error=f"HTTP {response.status}: {error_text}",
|
182
|
+
config=config.to_dict()
|
183
|
+
)
|
184
|
+
else:
|
185
|
+
markdown = await response.text()
|
186
|
+
result = CrawlResult(
|
187
|
+
url=url,
|
188
|
+
status="success",
|
189
|
+
markdown=markdown,
|
190
|
+
config=config.to_dict()
|
191
|
+
)
|
192
|
+
|
193
|
+
if config.output_dir:
|
194
|
+
await self._save_markdown(url, markdown, config.output_dir)
|
195
|
+
|
196
|
+
await self._send_webhook(result, config)
|
197
|
+
|
198
|
+
self.crawl_results.append(result)
|
199
|
+
return result
|
200
|
+
|
201
|
+
except Exception as e:
|
202
|
+
result = CrawlResult(
|
203
|
+
url=url,
|
204
|
+
status="failed",
|
205
|
+
error=str(e),
|
206
|
+
config=config.to_dict()
|
207
|
+
)
|
208
|
+
self.crawl_results.append(result)
|
209
|
+
return result
|
210
|
+
|
211
|
+
def crawl_url(self, url: str, config: CrawlConfig) -> CrawlResult:
|
212
|
+
"""Synchronous version of crawl_url_async."""
|
213
|
+
return asyncio.run(self.crawl_url_async(url, config))
|
214
|
+
|
215
|
+
async def crawl_urls_async(self, urls: List[str], config: CrawlConfig) -> List[CrawlResult]:
|
216
|
+
"""Crawl multiple URLs asynchronously with progress bar."""
|
217
|
+
await self._ensure_session()
|
218
|
+
|
219
|
+
with Progress(
|
220
|
+
SpinnerColumn(),
|
221
|
+
TextColumn("[progress.description]{task.description}"),
|
222
|
+
BarColumn(),
|
223
|
+
TaskProgressColumn(),
|
224
|
+
console=console
|
225
|
+
) as progress:
|
226
|
+
task = progress.add_task("[cyan]Crawling URLs...", total=len(urls))
|
227
|
+
|
228
|
+
async def crawl_with_progress(url):
|
229
|
+
result = await self.crawl_url_async(url, config)
|
230
|
+
progress.update(task, advance=1, description=f"[cyan]Crawled: {url}")
|
231
|
+
return result
|
232
|
+
|
233
|
+
semaphore = asyncio.Semaphore(config.max_concurrent_requests)
|
234
|
+
async def crawl_with_semaphore(url):
|
235
|
+
async with semaphore:
|
236
|
+
result = await crawl_with_progress(url)
|
237
|
+
await asyncio.sleep(config.request_delay)
|
238
|
+
return result
|
239
|
+
|
240
|
+
results = await asyncio.gather(*[crawl_with_semaphore(url) for url in urls])
|
241
|
+
|
242
|
+
# Save final report
|
243
|
+
await self._save_report(config)
|
244
|
+
|
245
|
+
# Print summary
|
246
|
+
successful = len([r for r in results if r.status == "success"])
|
247
|
+
failed = len([r for r in results if r.status == "failed"])
|
248
|
+
console.print(f"\n[green]Crawling completed:[/green]")
|
249
|
+
console.print(f"✓ Successful: {successful}")
|
250
|
+
console.print(f"✗ Failed: {failed}")
|
251
|
+
|
252
|
+
if config.report_file:
|
253
|
+
console.print(f"📊 Report saved to: {config.report_file}")
|
254
|
+
|
255
|
+
return results
|
256
|
+
|
257
|
+
def crawl_urls(self, urls: List[str], config: CrawlConfig) -> List[CrawlResult]:
|
258
|
+
"""Synchronous version of crawl_urls_async."""
|
259
|
+
return asyncio.run(self.crawl_urls_async(urls, config))
|
260
|
+
|
261
|
+
async def crawl_sitemap_async(self, sitemap_url: str, config: CrawlConfig) -> List[CrawlResult]:
|
262
|
+
"""Crawl URLs from a sitemap asynchronously."""
|
263
|
+
await self._ensure_session()
|
264
|
+
|
265
|
+
try:
|
266
|
+
console.print(f"[cyan]Fetching sitemap from {sitemap_url}...[/cyan]")
|
267
|
+
async with self.session.get(sitemap_url, timeout=config.timeout) as response:
|
268
|
+
sitemap_text = await response.text()
|
269
|
+
except Exception as e:
|
270
|
+
console.print(f"[red]Error fetching sitemap: {str(e)}[/red]")
|
271
|
+
raise
|
272
|
+
|
273
|
+
try:
|
274
|
+
root = ET.fromstring(sitemap_text)
|
275
|
+
namespace = {'ns': root.tag.split('}')[0].strip('{')}
|
276
|
+
urls = [loc.text for loc in root.findall('.//ns:loc', namespace)]
|
277
|
+
console.print(f"[green]Found {len(urls)} URLs in sitemap[/green]")
|
278
|
+
except Exception as e:
|
279
|
+
console.print(f"[red]Error parsing sitemap: {str(e)}[/red]")
|
280
|
+
raise
|
281
|
+
|
282
|
+
return await self.crawl_urls_async(urls, config)
|
283
|
+
|
284
|
+
def crawl_sitemap(self, sitemap_url: str, config: CrawlConfig) -> List[CrawlResult]:
|
285
|
+
"""Synchronous version of crawl_sitemap_async."""
|
286
|
+
return asyncio.run(self.crawl_sitemap_async(sitemap_url, config))
|
287
|
+
|
288
|
+
async def __aenter__(self):
|
289
|
+
"""Async context manager entry."""
|
290
|
+
await self._ensure_session()
|
291
|
+
return self
|
292
|
+
|
293
|
+
async def __aexit__(self, exc_type, exc_val, exc_tb):
|
294
|
+
"""Async context manager exit."""
|
295
|
+
await self._close_session()
|
296
|
+
|
297
|
+
def __enter__(self):
|
298
|
+
"""Sync context manager entry."""
|
299
|
+
return self
|
300
|
+
|
301
|
+
def __exit__(self, exc_type, exc_val, exc_tb):
|
302
|
+
"""Sync context manager exit."""
|
303
|
+
self._executor.shutdown(wait=True)
|
@@ -0,0 +1,239 @@
|
|
1
|
+
Metadata-Version: 2.2
|
2
|
+
Name: spiderforce4ai
|
3
|
+
Version: 0.1.0
|
4
|
+
Summary: Python wrapper for SpiderForce4AI HTML-to-Markdown conversion service
|
5
|
+
Home-page: https://petertam.pro
|
6
|
+
Author: Piotr Tamulewicz
|
7
|
+
Author-email: Piotr Tamulewicz <pt@petertam.pro>
|
8
|
+
License: MIT
|
9
|
+
Classifier: Development Status :: 4 - Beta
|
10
|
+
Classifier: Intended Audience :: Developers
|
11
|
+
Classifier: License :: OSI Approved :: MIT License
|
12
|
+
Classifier: Programming Language :: Python :: 3.11
|
13
|
+
Classifier: Programming Language :: Python :: 3.12
|
14
|
+
Requires-Python: >=3.11
|
15
|
+
Description-Content-Type: text/markdown
|
16
|
+
Requires-Dist: aiohttp>=3.8.0
|
17
|
+
Requires-Dist: asyncio>=3.4.3
|
18
|
+
Requires-Dist: rich>=10.0.0
|
19
|
+
Requires-Dist: aiofiles>=0.8.0
|
20
|
+
Requires-Dist: httpx>=0.24.0
|
21
|
+
Dynamic: author
|
22
|
+
Dynamic: home-page
|
23
|
+
Dynamic: requires-python
|
24
|
+
|
25
|
+
# SpiderForce4AI Python Wrapper
|
26
|
+
|
27
|
+
A Python wrapper for SpiderForce4AI - a powerful HTML-to-Markdown conversion service. This package provides an easy-to-use interface for crawling websites and converting their content to clean Markdown format.
|
28
|
+
|
29
|
+
## Features
|
30
|
+
|
31
|
+
- 🔄 Simple synchronous and asynchronous APIs
|
32
|
+
- 📁 Automatic Markdown file saving with URL-based filenames
|
33
|
+
- 📊 Real-time progress tracking in console
|
34
|
+
- 🪝 Webhook support for real-time notifications
|
35
|
+
- 📝 Detailed crawl reports in JSON format
|
36
|
+
- ⚡ Concurrent crawling with rate limiting
|
37
|
+
- 🔍 Support for sitemap.xml crawling
|
38
|
+
- 🛡️ Comprehensive error handling
|
39
|
+
|
40
|
+
## Installation
|
41
|
+
|
42
|
+
```bash
|
43
|
+
pip install spiderforce4ai
|
44
|
+
```
|
45
|
+
|
46
|
+
## Quick Start
|
47
|
+
|
48
|
+
```python
|
49
|
+
from spiderforce4ai import SpiderForce4AI, CrawlConfig
|
50
|
+
|
51
|
+
# Initialize the client
|
52
|
+
spider = SpiderForce4AI("http://localhost:3004")
|
53
|
+
|
54
|
+
# Use default configuration
|
55
|
+
config = CrawlConfig()
|
56
|
+
|
57
|
+
# Crawl a single URL
|
58
|
+
result = spider.crawl_url("https://example.com", config)
|
59
|
+
|
60
|
+
# Crawl multiple URLs
|
61
|
+
urls = [
|
62
|
+
"https://example.com/page1",
|
63
|
+
"https://example.com/page2"
|
64
|
+
]
|
65
|
+
results = spider.crawl_urls(urls, config)
|
66
|
+
|
67
|
+
# Crawl from sitemap
|
68
|
+
results = spider.crawl_sitemap("https://example.com/sitemap.xml", config)
|
69
|
+
```
|
70
|
+
|
71
|
+
## Configuration
|
72
|
+
|
73
|
+
The `CrawlConfig` class provides various configuration options. All parameters are optional with sensible defaults:
|
74
|
+
|
75
|
+
```python
|
76
|
+
config = CrawlConfig(
|
77
|
+
# Content Selection (all optional)
|
78
|
+
target_selector="article", # Specific element to target
|
79
|
+
remove_selectors=[".ads", "#popup"], # Elements to remove
|
80
|
+
remove_selectors_regex=["modal-\\d+"], # Regex patterns for removal
|
81
|
+
|
82
|
+
# Processing Settings
|
83
|
+
max_concurrent_requests=1, # Default: 1
|
84
|
+
request_delay=0.5, # Delay between requests in seconds
|
85
|
+
timeout=30, # Request timeout in seconds
|
86
|
+
|
87
|
+
# Output Settings
|
88
|
+
output_dir="spiderforce_reports", # Default output directory
|
89
|
+
webhook_url="https://your-webhook.com", # Optional webhook endpoint
|
90
|
+
webhook_timeout=10, # Webhook timeout in seconds
|
91
|
+
report_file=None # Optional custom report location
|
92
|
+
)
|
93
|
+
```
|
94
|
+
|
95
|
+
### Default Directory Structure
|
96
|
+
|
97
|
+
```
|
98
|
+
./
|
99
|
+
└── spiderforce_reports/
|
100
|
+
├── example-com-page1.md
|
101
|
+
├── example-com-page2.md
|
102
|
+
└── crawl_report.json
|
103
|
+
```
|
104
|
+
|
105
|
+
## Webhook Notifications
|
106
|
+
|
107
|
+
If `webhook_url` is configured, the crawler sends POST requests with the following JSON structure:
|
108
|
+
|
109
|
+
```json
|
110
|
+
{
|
111
|
+
"url": "https://example.com/page1",
|
112
|
+
"status": "success",
|
113
|
+
"markdown": "# Page Title\n\nContent...",
|
114
|
+
"timestamp": "2025-02-15T10:30:00.123456",
|
115
|
+
"config": {
|
116
|
+
"target_selector": "article",
|
117
|
+
"remove_selectors": [".ads", "#popup"],
|
118
|
+
"remove_selectors_regex": ["modal-\\d+"]
|
119
|
+
}
|
120
|
+
}
|
121
|
+
```
|
122
|
+
|
123
|
+
## Crawl Report
|
124
|
+
|
125
|
+
A comprehensive JSON report is automatically generated in the output directory:
|
126
|
+
|
127
|
+
```json
|
128
|
+
{
|
129
|
+
"timestamp": "2025-02-15T10:30:00.123456",
|
130
|
+
"config": {
|
131
|
+
"target_selector": "article",
|
132
|
+
"remove_selectors": [".ads", "#popup"],
|
133
|
+
"remove_selectors_regex": ["modal-\\d+"]
|
134
|
+
},
|
135
|
+
"results": {
|
136
|
+
"successful": [
|
137
|
+
{
|
138
|
+
"url": "https://example.com/page1",
|
139
|
+
"status": "success",
|
140
|
+
"markdown": "# Page Title\n\nContent...",
|
141
|
+
"timestamp": "2025-02-15T10:30:00.123456"
|
142
|
+
}
|
143
|
+
],
|
144
|
+
"failed": [
|
145
|
+
{
|
146
|
+
"url": "https://example.com/page2",
|
147
|
+
"status": "failed",
|
148
|
+
"error": "HTTP 404: Not Found",
|
149
|
+
"timestamp": "2025-02-15T10:30:01.123456"
|
150
|
+
}
|
151
|
+
]
|
152
|
+
},
|
153
|
+
"summary": {
|
154
|
+
"total": 2,
|
155
|
+
"successful": 1,
|
156
|
+
"failed": 1
|
157
|
+
}
|
158
|
+
}
|
159
|
+
```
|
160
|
+
|
161
|
+
## Async Usage
|
162
|
+
|
163
|
+
```python
|
164
|
+
import asyncio
|
165
|
+
from spiderforce4ai import SpiderForce4AI, CrawlConfig
|
166
|
+
|
167
|
+
async def main():
|
168
|
+
config = CrawlConfig()
|
169
|
+
spider = SpiderForce4AI("http://localhost:3004")
|
170
|
+
|
171
|
+
async with spider:
|
172
|
+
results = await spider.crawl_urls_async(
|
173
|
+
["https://example.com/page1", "https://example.com/page2"],
|
174
|
+
config
|
175
|
+
)
|
176
|
+
|
177
|
+
return results
|
178
|
+
|
179
|
+
if __name__ == "__main__":
|
180
|
+
results = asyncio.run(main())
|
181
|
+
```
|
182
|
+
|
183
|
+
## Error Handling
|
184
|
+
|
185
|
+
The crawler is designed to be resilient:
|
186
|
+
- Continues processing even if some URLs fail
|
187
|
+
- Records all errors in the crawl report
|
188
|
+
- Sends error notifications via webhook if configured
|
189
|
+
- Provides clear error messages in console output
|
190
|
+
|
191
|
+
## Progress Tracking
|
192
|
+
|
193
|
+
The crawler provides real-time progress tracking in the console:
|
194
|
+
|
195
|
+
```
|
196
|
+
🔄 Crawling URLs... [####################] 100%
|
197
|
+
✓ Successful: 95
|
198
|
+
✗ Failed: 5
|
199
|
+
📊 Report saved to: ./spiderforce_reports/crawl_report.json
|
200
|
+
```
|
201
|
+
|
202
|
+
## Usage with AI Agents
|
203
|
+
|
204
|
+
The package is designed to be easily integrated with AI agents and chat systems:
|
205
|
+
|
206
|
+
```python
|
207
|
+
from spiderforce4ai import SpiderForce4AI, CrawlConfig
|
208
|
+
|
209
|
+
def fetch_content_for_ai(urls):
|
210
|
+
spider = SpiderForce4AI("http://localhost:3004")
|
211
|
+
config = CrawlConfig()
|
212
|
+
|
213
|
+
# Crawl content
|
214
|
+
results = spider.crawl_urls(urls, config)
|
215
|
+
|
216
|
+
# Return successful results
|
217
|
+
return {
|
218
|
+
result.url: result.markdown
|
219
|
+
for result in results
|
220
|
+
if result.status == "success"
|
221
|
+
}
|
222
|
+
|
223
|
+
# Use with AI agent
|
224
|
+
urls = ["https://example.com/article1", "https://example.com/article2"]
|
225
|
+
content = fetch_content_for_ai(urls)
|
226
|
+
```
|
227
|
+
|
228
|
+
## Requirements
|
229
|
+
|
230
|
+
- Python 3.11 or later
|
231
|
+
- Docker (for running SpiderForce4AI service)
|
232
|
+
|
233
|
+
## License
|
234
|
+
|
235
|
+
MIT License
|
236
|
+
|
237
|
+
## Credits
|
238
|
+
|
239
|
+
Created by [Peter Tam](https://petertam.pro)
|
@@ -0,0 +1,9 @@
|
|
1
|
+
README.md
|
2
|
+
pyproject.toml
|
3
|
+
setup.py
|
4
|
+
spiderforce4ai/__init__.py
|
5
|
+
spiderforce4ai.egg-info/PKG-INFO
|
6
|
+
spiderforce4ai.egg-info/SOURCES.txt
|
7
|
+
spiderforce4ai.egg-info/dependency_links.txt
|
8
|
+
spiderforce4ai.egg-info/requires.txt
|
9
|
+
spiderforce4ai.egg-info/top_level.txt
|
@@ -0,0 +1 @@
|
|
1
|
+
|
@@ -0,0 +1 @@
|
|
1
|
+
spiderforce4ai
|