spiderforce4ai 0.1.5__tar.gz → 0.1.6__tar.gz

Sign up to get free protection for your applications and to get access to all the features.
@@ -0,0 +1,278 @@
1
+ Metadata-Version: 2.2
2
+ Name: spiderforce4ai
3
+ Version: 0.1.6
4
+ Summary: Python wrapper for SpiderForce4AI HTML-to-Markdown conversion service
5
+ Home-page: https://petertam.pro
6
+ Author: Piotr Tamulewicz
7
+ Author-email: Piotr Tamulewicz <pt@petertam.pro>
8
+ License: MIT
9
+ Classifier: Development Status :: 4 - Beta
10
+ Classifier: Intended Audience :: Developers
11
+ Classifier: License :: OSI Approved :: MIT License
12
+ Classifier: Programming Language :: Python :: 3.11
13
+ Classifier: Programming Language :: Python :: 3.12
14
+ Requires-Python: >=3.11
15
+ Description-Content-Type: text/markdown
16
+ Requires-Dist: aiohttp>=3.8.0
17
+ Requires-Dist: asyncio>=3.4.3
18
+ Requires-Dist: rich>=10.0.0
19
+ Requires-Dist: aiofiles>=0.8.0
20
+ Requires-Dist: httpx>=0.24.0
21
+ Dynamic: author
22
+ Dynamic: home-page
23
+ Dynamic: requires-python
24
+
25
+ # SpiderForce4AI Python Wrapper
26
+
27
+ A Python wrapper for SpiderForce4AI - a powerful HTML-to-Markdown conversion service. This package provides an easy-to-use interface for crawling websites and converting their content to clean Markdown format.
28
+
29
+ ## Installation
30
+
31
+ ```bash
32
+ pip install spiderforce4ai
33
+ ```
34
+
35
+ ## Quick Start (Minimal Setup)
36
+
37
+ ```python
38
+ from spiderforce4ai import SpiderForce4AI, CrawlConfig
39
+
40
+ # Initialize with your SpiderForce4AI service URL
41
+ spider = SpiderForce4AI("http://localhost:3004")
42
+
43
+ # Use default configuration (will save in ./spiderforce_reports)
44
+ config = CrawlConfig()
45
+
46
+ # Crawl a single URL
47
+ result = spider.crawl_url("https://example.com", config)
48
+ ```
49
+
50
+ ## Crawling Methods
51
+
52
+ ### 1. Single URL Crawling
53
+
54
+ ```python
55
+ # Synchronous
56
+ result = spider.crawl_url("https://example.com", config)
57
+
58
+ # Asynchronous
59
+ async def crawl():
60
+ result = await spider.crawl_url_async("https://example.com", config)
61
+ ```
62
+
63
+ ### 2. Multiple URLs Crawling
64
+
65
+ ```python
66
+ # List of URLs
67
+ urls = [
68
+ "https://example.com/page1",
69
+ "https://example.com/page2",
70
+ "https://example.com/page3"
71
+ ]
72
+
73
+ # Synchronous
74
+ results = spider.crawl_urls(urls, config)
75
+
76
+ # Asynchronous
77
+ async def crawl():
78
+ results = await spider.crawl_urls_async(urls, config)
79
+
80
+ # Parallel (using multiprocessing)
81
+ results = spider.crawl_urls_parallel(urls, config)
82
+ ```
83
+
84
+ ### 3. Sitemap Crawling
85
+
86
+ ```python
87
+ # Synchronous
88
+ results = spider.crawl_sitemap("https://example.com/sitemap.xml", config)
89
+
90
+ # Asynchronous
91
+ async def crawl():
92
+ results = await spider.crawl_sitemap_async("https://example.com/sitemap.xml", config)
93
+
94
+ # Parallel (using multiprocessing)
95
+ results = spider.crawl_sitemap_parallel("https://example.com/sitemap.xml", config)
96
+ ```
97
+
98
+ ## Configuration Options
99
+
100
+ All configuration options are optional with sensible defaults:
101
+
102
+ ```python
103
+ config = CrawlConfig(
104
+ # Content Selection (all optional)
105
+ target_selector="article", # Specific element to target
106
+ remove_selectors=[ # Elements to remove
107
+ ".ads",
108
+ "#popup",
109
+ ".navigation",
110
+ ".footer"
111
+ ],
112
+ remove_selectors_regex=["modal-\\d+"], # Regex patterns for removal
113
+
114
+ # Processing Settings
115
+ max_concurrent_requests=1, # Default: 1 (parallel processing)
116
+ request_delay=0.5, # Delay between requests in seconds
117
+ timeout=30, # Request timeout in seconds
118
+
119
+ # Output Settings
120
+ output_dir="custom_output", # Default: "spiderforce_reports"
121
+ report_file="custom_report.json", # Default: "crawl_report.json"
122
+ webhook_url="https://your-webhook.com", # Optional webhook endpoint
123
+ webhook_timeout=10 # Webhook timeout in seconds
124
+ )
125
+ ```
126
+
127
+ ## Real-World Examples
128
+
129
+ ### 1. Basic Website Crawling
130
+
131
+ ```python
132
+ from spiderforce4ai import SpiderForce4AI, CrawlConfig
133
+ from pathlib import Path
134
+
135
+ spider = SpiderForce4AI("http://localhost:3004")
136
+ config = CrawlConfig(
137
+ output_dir=Path("blog_content")
138
+ )
139
+
140
+ result = spider.crawl_url("https://example.com/blog", config)
141
+ print(f"Content saved to: {result.url}.md")
142
+ ```
143
+
144
+ ### 2. Advanced Parallel Sitemap Crawling
145
+
146
+ ```python
147
+ config = CrawlConfig(
148
+ max_concurrent_requests=5,
149
+ output_dir=Path("website_content"),
150
+ remove_selectors=[
151
+ ".navigation",
152
+ ".footer",
153
+ ".ads",
154
+ "#cookie-notice"
155
+ ],
156
+ webhook_url="https://your-webhook.com/endpoint"
157
+ )
158
+
159
+ results = spider.crawl_sitemap_parallel(
160
+ "https://example.com/sitemap.xml",
161
+ config
162
+ )
163
+ ```
164
+
165
+ ### 3. Async Crawling with Progress
166
+
167
+ ```python
168
+ import asyncio
169
+
170
+ async def main():
171
+ config = CrawlConfig(
172
+ max_concurrent_requests=3,
173
+ request_delay=1.0
174
+ )
175
+
176
+ async with spider:
177
+ results = await spider.crawl_urls_async([
178
+ "https://example.com/1",
179
+ "https://example.com/2",
180
+ "https://example.com/3"
181
+ ], config)
182
+
183
+ return results
184
+
185
+ results = asyncio.run(main())
186
+ ```
187
+
188
+ ## Output Structure
189
+
190
+ ### 1. File Organization
191
+ ```
192
+ output_dir/
193
+ ├── example-com-page1.md
194
+ ├── example-com-page2.md
195
+ └── crawl_report.json
196
+ ```
197
+
198
+ ### 2. Markdown Files
199
+ Each markdown file is named using a slugified version of the URL and contains the converted content.
200
+
201
+ ### 3. Report JSON Structure
202
+ ```json
203
+ {
204
+ "timestamp": "2025-02-15T10:30:00.123456",
205
+ "config": {
206
+ "target_selector": "article",
207
+ "remove_selectors": [".ads", "#popup"],
208
+ "remove_selectors_regex": ["modal-\\d+"]
209
+ },
210
+ "results": {
211
+ "successful": [
212
+ {
213
+ "url": "https://example.com/page1",
214
+ "status": "success",
215
+ "markdown": "# Page Title\n\nContent...",
216
+ "timestamp": "2025-02-15T10:30:00.123456"
217
+ }
218
+ ],
219
+ "failed": [
220
+ {
221
+ "url": "https://example.com/page2",
222
+ "status": "failed",
223
+ "error": "HTTP 404: Not Found",
224
+ "timestamp": "2025-02-15T10:30:01.123456"
225
+ }
226
+ ]
227
+ },
228
+ "summary": {
229
+ "total": 2,
230
+ "successful": 1,
231
+ "failed": 1
232
+ }
233
+ }
234
+ ```
235
+
236
+ ### 4. Webhook Notifications
237
+ If configured, webhooks receive real-time updates in JSON format:
238
+ ```json
239
+ {
240
+ "url": "https://example.com/page1",
241
+ "status": "success",
242
+ "markdown": "# Page Title\n\nContent...",
243
+ "timestamp": "2025-02-15T10:30:00.123456",
244
+ "config": {
245
+ "target_selector": "article",
246
+ "remove_selectors": [".ads", "#popup"]
247
+ }
248
+ }
249
+ ```
250
+
251
+ ## Error Handling
252
+
253
+ The package handles various types of errors:
254
+ - Network errors
255
+ - Timeout errors
256
+ - Invalid URLs
257
+ - Missing content
258
+ - Service errors
259
+
260
+ All errors are:
261
+ 1. Logged in the console
262
+ 2. Included in the JSON report
263
+ 3. Sent via webhook (if configured)
264
+ 4. Available in the results list
265
+
266
+ ## Requirements
267
+
268
+ - Python 3.11 or later
269
+ - Running SpiderForce4AI service
270
+ - Internet connection
271
+
272
+ ## License
273
+
274
+ MIT License
275
+
276
+ ## Credits
277
+
278
+ Created by [Peter Tam](https://petertam.pro)
@@ -0,0 +1,254 @@
1
+ # SpiderForce4AI Python Wrapper
2
+
3
+ A Python wrapper for SpiderForce4AI - a powerful HTML-to-Markdown conversion service. This package provides an easy-to-use interface for crawling websites and converting their content to clean Markdown format.
4
+
5
+ ## Installation
6
+
7
+ ```bash
8
+ pip install spiderforce4ai
9
+ ```
10
+
11
+ ## Quick Start (Minimal Setup)
12
+
13
+ ```python
14
+ from spiderforce4ai import SpiderForce4AI, CrawlConfig
15
+
16
+ # Initialize with your SpiderForce4AI service URL
17
+ spider = SpiderForce4AI("http://localhost:3004")
18
+
19
+ # Use default configuration (will save in ./spiderforce_reports)
20
+ config = CrawlConfig()
21
+
22
+ # Crawl a single URL
23
+ result = spider.crawl_url("https://example.com", config)
24
+ ```
25
+
26
+ ## Crawling Methods
27
+
28
+ ### 1. Single URL Crawling
29
+
30
+ ```python
31
+ # Synchronous
32
+ result = spider.crawl_url("https://example.com", config)
33
+
34
+ # Asynchronous
35
+ async def crawl():
36
+ result = await spider.crawl_url_async("https://example.com", config)
37
+ ```
38
+
39
+ ### 2. Multiple URLs Crawling
40
+
41
+ ```python
42
+ # List of URLs
43
+ urls = [
44
+ "https://example.com/page1",
45
+ "https://example.com/page2",
46
+ "https://example.com/page3"
47
+ ]
48
+
49
+ # Synchronous
50
+ results = spider.crawl_urls(urls, config)
51
+
52
+ # Asynchronous
53
+ async def crawl():
54
+ results = await spider.crawl_urls_async(urls, config)
55
+
56
+ # Parallel (using multiprocessing)
57
+ results = spider.crawl_urls_parallel(urls, config)
58
+ ```
59
+
60
+ ### 3. Sitemap Crawling
61
+
62
+ ```python
63
+ # Synchronous
64
+ results = spider.crawl_sitemap("https://example.com/sitemap.xml", config)
65
+
66
+ # Asynchronous
67
+ async def crawl():
68
+ results = await spider.crawl_sitemap_async("https://example.com/sitemap.xml", config)
69
+
70
+ # Parallel (using multiprocessing)
71
+ results = spider.crawl_sitemap_parallel("https://example.com/sitemap.xml", config)
72
+ ```
73
+
74
+ ## Configuration Options
75
+
76
+ All configuration options are optional with sensible defaults:
77
+
78
+ ```python
79
+ config = CrawlConfig(
80
+ # Content Selection (all optional)
81
+ target_selector="article", # Specific element to target
82
+ remove_selectors=[ # Elements to remove
83
+ ".ads",
84
+ "#popup",
85
+ ".navigation",
86
+ ".footer"
87
+ ],
88
+ remove_selectors_regex=["modal-\\d+"], # Regex patterns for removal
89
+
90
+ # Processing Settings
91
+ max_concurrent_requests=1, # Default: 1 (parallel processing)
92
+ request_delay=0.5, # Delay between requests in seconds
93
+ timeout=30, # Request timeout in seconds
94
+
95
+ # Output Settings
96
+ output_dir="custom_output", # Default: "spiderforce_reports"
97
+ report_file="custom_report.json", # Default: "crawl_report.json"
98
+ webhook_url="https://your-webhook.com", # Optional webhook endpoint
99
+ webhook_timeout=10 # Webhook timeout in seconds
100
+ )
101
+ ```
102
+
103
+ ## Real-World Examples
104
+
105
+ ### 1. Basic Website Crawling
106
+
107
+ ```python
108
+ from spiderforce4ai import SpiderForce4AI, CrawlConfig
109
+ from pathlib import Path
110
+
111
+ spider = SpiderForce4AI("http://localhost:3004")
112
+ config = CrawlConfig(
113
+ output_dir=Path("blog_content")
114
+ )
115
+
116
+ result = spider.crawl_url("https://example.com/blog", config)
117
+ print(f"Content saved to: {result.url}.md")
118
+ ```
119
+
120
+ ### 2. Advanced Parallel Sitemap Crawling
121
+
122
+ ```python
123
+ config = CrawlConfig(
124
+ max_concurrent_requests=5,
125
+ output_dir=Path("website_content"),
126
+ remove_selectors=[
127
+ ".navigation",
128
+ ".footer",
129
+ ".ads",
130
+ "#cookie-notice"
131
+ ],
132
+ webhook_url="https://your-webhook.com/endpoint"
133
+ )
134
+
135
+ results = spider.crawl_sitemap_parallel(
136
+ "https://example.com/sitemap.xml",
137
+ config
138
+ )
139
+ ```
140
+
141
+ ### 3. Async Crawling with Progress
142
+
143
+ ```python
144
+ import asyncio
145
+
146
+ async def main():
147
+ config = CrawlConfig(
148
+ max_concurrent_requests=3,
149
+ request_delay=1.0
150
+ )
151
+
152
+ async with spider:
153
+ results = await spider.crawl_urls_async([
154
+ "https://example.com/1",
155
+ "https://example.com/2",
156
+ "https://example.com/3"
157
+ ], config)
158
+
159
+ return results
160
+
161
+ results = asyncio.run(main())
162
+ ```
163
+
164
+ ## Output Structure
165
+
166
+ ### 1. File Organization
167
+ ```
168
+ output_dir/
169
+ ├── example-com-page1.md
170
+ ├── example-com-page2.md
171
+ └── crawl_report.json
172
+ ```
173
+
174
+ ### 2. Markdown Files
175
+ Each markdown file is named using a slugified version of the URL and contains the converted content.
176
+
177
+ ### 3. Report JSON Structure
178
+ ```json
179
+ {
180
+ "timestamp": "2025-02-15T10:30:00.123456",
181
+ "config": {
182
+ "target_selector": "article",
183
+ "remove_selectors": [".ads", "#popup"],
184
+ "remove_selectors_regex": ["modal-\\d+"]
185
+ },
186
+ "results": {
187
+ "successful": [
188
+ {
189
+ "url": "https://example.com/page1",
190
+ "status": "success",
191
+ "markdown": "# Page Title\n\nContent...",
192
+ "timestamp": "2025-02-15T10:30:00.123456"
193
+ }
194
+ ],
195
+ "failed": [
196
+ {
197
+ "url": "https://example.com/page2",
198
+ "status": "failed",
199
+ "error": "HTTP 404: Not Found",
200
+ "timestamp": "2025-02-15T10:30:01.123456"
201
+ }
202
+ ]
203
+ },
204
+ "summary": {
205
+ "total": 2,
206
+ "successful": 1,
207
+ "failed": 1
208
+ }
209
+ }
210
+ ```
211
+
212
+ ### 4. Webhook Notifications
213
+ If configured, webhooks receive real-time updates in JSON format:
214
+ ```json
215
+ {
216
+ "url": "https://example.com/page1",
217
+ "status": "success",
218
+ "markdown": "# Page Title\n\nContent...",
219
+ "timestamp": "2025-02-15T10:30:00.123456",
220
+ "config": {
221
+ "target_selector": "article",
222
+ "remove_selectors": [".ads", "#popup"]
223
+ }
224
+ }
225
+ ```
226
+
227
+ ## Error Handling
228
+
229
+ The package handles various types of errors:
230
+ - Network errors
231
+ - Timeout errors
232
+ - Invalid URLs
233
+ - Missing content
234
+ - Service errors
235
+
236
+ All errors are:
237
+ 1. Logged in the console
238
+ 2. Included in the JSON report
239
+ 3. Sent via webhook (if configured)
240
+ 4. Available in the results list
241
+
242
+ ## Requirements
243
+
244
+ - Python 3.11 or later
245
+ - Running SpiderForce4AI service
246
+ - Internet connection
247
+
248
+ ## License
249
+
250
+ MIT License
251
+
252
+ ## Credits
253
+
254
+ Created by [Peter Tam](https://petertam.pro)
@@ -4,7 +4,7 @@ build-backend = "setuptools.build_meta"
4
4
 
5
5
  [project]
6
6
  name = "spiderforce4ai"
7
- version = "0.1.5"
7
+ version = "0.1.6"
8
8
  description = "Python wrapper for SpiderForce4AI HTML-to-Markdown conversion service"
9
9
  readme = "README.md"
10
10
  authors = [{name = "Piotr Tamulewicz", email = "pt@petertam.pro"}]
@@ -3,7 +3,7 @@ from setuptools import setup, find_packages
3
3
 
4
4
  setup(
5
5
  name="spiderforce4ai",
6
- version="0.1.5",
6
+ version="0.1.6",
7
7
  author="Piotr Tamulewicz",
8
8
  author_email="pt@petertam.pro",
9
9
  description="Python wrapper for SpiderForce4AI HTML-to-Markdown conversion service",