spiderforce4ai 0.1.5__py3-none-any.whl → 0.1.6__py3-none-any.whl

Sign up to get free protection for your applications and to get access to all the features.
@@ -0,0 +1,278 @@
1
+ Metadata-Version: 2.2
2
+ Name: spiderforce4ai
3
+ Version: 0.1.6
4
+ Summary: Python wrapper for SpiderForce4AI HTML-to-Markdown conversion service
5
+ Home-page: https://petertam.pro
6
+ Author: Piotr Tamulewicz
7
+ Author-email: Piotr Tamulewicz <pt@petertam.pro>
8
+ License: MIT
9
+ Classifier: Development Status :: 4 - Beta
10
+ Classifier: Intended Audience :: Developers
11
+ Classifier: License :: OSI Approved :: MIT License
12
+ Classifier: Programming Language :: Python :: 3.11
13
+ Classifier: Programming Language :: Python :: 3.12
14
+ Requires-Python: >=3.11
15
+ Description-Content-Type: text/markdown
16
+ Requires-Dist: aiohttp>=3.8.0
17
+ Requires-Dist: asyncio>=3.4.3
18
+ Requires-Dist: rich>=10.0.0
19
+ Requires-Dist: aiofiles>=0.8.0
20
+ Requires-Dist: httpx>=0.24.0
21
+ Dynamic: author
22
+ Dynamic: home-page
23
+ Dynamic: requires-python
24
+
25
+ # SpiderForce4AI Python Wrapper
26
+
27
+ A Python wrapper for SpiderForce4AI - a powerful HTML-to-Markdown conversion service. This package provides an easy-to-use interface for crawling websites and converting their content to clean Markdown format.
28
+
29
+ ## Installation
30
+
31
+ ```bash
32
+ pip install spiderforce4ai
33
+ ```
34
+
35
+ ## Quick Start (Minimal Setup)
36
+
37
+ ```python
38
+ from spiderforce4ai import SpiderForce4AI, CrawlConfig
39
+
40
+ # Initialize with your SpiderForce4AI service URL
41
+ spider = SpiderForce4AI("http://localhost:3004")
42
+
43
+ # Use default configuration (will save in ./spiderforce_reports)
44
+ config = CrawlConfig()
45
+
46
+ # Crawl a single URL
47
+ result = spider.crawl_url("https://example.com", config)
48
+ ```
49
+
50
+ ## Crawling Methods
51
+
52
+ ### 1. Single URL Crawling
53
+
54
+ ```python
55
+ # Synchronous
56
+ result = spider.crawl_url("https://example.com", config)
57
+
58
+ # Asynchronous
59
+ async def crawl():
60
+ result = await spider.crawl_url_async("https://example.com", config)
61
+ ```
62
+
63
+ ### 2. Multiple URLs Crawling
64
+
65
+ ```python
66
+ # List of URLs
67
+ urls = [
68
+ "https://example.com/page1",
69
+ "https://example.com/page2",
70
+ "https://example.com/page3"
71
+ ]
72
+
73
+ # Synchronous
74
+ results = spider.crawl_urls(urls, config)
75
+
76
+ # Asynchronous
77
+ async def crawl():
78
+ results = await spider.crawl_urls_async(urls, config)
79
+
80
+ # Parallel (using multiprocessing)
81
+ results = spider.crawl_urls_parallel(urls, config)
82
+ ```
83
+
84
+ ### 3. Sitemap Crawling
85
+
86
+ ```python
87
+ # Synchronous
88
+ results = spider.crawl_sitemap("https://example.com/sitemap.xml", config)
89
+
90
+ # Asynchronous
91
+ async def crawl():
92
+ results = await spider.crawl_sitemap_async("https://example.com/sitemap.xml", config)
93
+
94
+ # Parallel (using multiprocessing)
95
+ results = spider.crawl_sitemap_parallel("https://example.com/sitemap.xml", config)
96
+ ```
97
+
98
+ ## Configuration Options
99
+
100
+ All configuration options are optional with sensible defaults:
101
+
102
+ ```python
103
+ config = CrawlConfig(
104
+ # Content Selection (all optional)
105
+ target_selector="article", # Specific element to target
106
+ remove_selectors=[ # Elements to remove
107
+ ".ads",
108
+ "#popup",
109
+ ".navigation",
110
+ ".footer"
111
+ ],
112
+ remove_selectors_regex=["modal-\\d+"], # Regex patterns for removal
113
+
114
+ # Processing Settings
115
+ max_concurrent_requests=1, # Default: 1 (parallel processing)
116
+ request_delay=0.5, # Delay between requests in seconds
117
+ timeout=30, # Request timeout in seconds
118
+
119
+ # Output Settings
120
+ output_dir="custom_output", # Default: "spiderforce_reports"
121
+ report_file="custom_report.json", # Default: "crawl_report.json"
122
+ webhook_url="https://your-webhook.com", # Optional webhook endpoint
123
+ webhook_timeout=10 # Webhook timeout in seconds
124
+ )
125
+ ```
126
+
127
+ ## Real-World Examples
128
+
129
+ ### 1. Basic Website Crawling
130
+
131
+ ```python
132
+ from spiderforce4ai import SpiderForce4AI, CrawlConfig
133
+ from pathlib import Path
134
+
135
+ spider = SpiderForce4AI("http://localhost:3004")
136
+ config = CrawlConfig(
137
+ output_dir=Path("blog_content")
138
+ )
139
+
140
+ result = spider.crawl_url("https://example.com/blog", config)
141
+ print(f"Content saved to: {result.url}.md")
142
+ ```
143
+
144
+ ### 2. Advanced Parallel Sitemap Crawling
145
+
146
+ ```python
147
+ config = CrawlConfig(
148
+ max_concurrent_requests=5,
149
+ output_dir=Path("website_content"),
150
+ remove_selectors=[
151
+ ".navigation",
152
+ ".footer",
153
+ ".ads",
154
+ "#cookie-notice"
155
+ ],
156
+ webhook_url="https://your-webhook.com/endpoint"
157
+ )
158
+
159
+ results = spider.crawl_sitemap_parallel(
160
+ "https://example.com/sitemap.xml",
161
+ config
162
+ )
163
+ ```
164
+
165
+ ### 3. Async Crawling with Progress
166
+
167
+ ```python
168
+ import asyncio
169
+
170
+ async def main():
171
+ config = CrawlConfig(
172
+ max_concurrent_requests=3,
173
+ request_delay=1.0
174
+ )
175
+
176
+ async with spider:
177
+ results = await spider.crawl_urls_async([
178
+ "https://example.com/1",
179
+ "https://example.com/2",
180
+ "https://example.com/3"
181
+ ], config)
182
+
183
+ return results
184
+
185
+ results = asyncio.run(main())
186
+ ```
187
+
188
+ ## Output Structure
189
+
190
+ ### 1. File Organization
191
+ ```
192
+ output_dir/
193
+ ├── example-com-page1.md
194
+ ├── example-com-page2.md
195
+ └── crawl_report.json
196
+ ```
197
+
198
+ ### 2. Markdown Files
199
+ Each markdown file is named using a slugified version of the URL and contains the converted content.
200
+
201
+ ### 3. Report JSON Structure
202
+ ```json
203
+ {
204
+ "timestamp": "2025-02-15T10:30:00.123456",
205
+ "config": {
206
+ "target_selector": "article",
207
+ "remove_selectors": [".ads", "#popup"],
208
+ "remove_selectors_regex": ["modal-\\d+"]
209
+ },
210
+ "results": {
211
+ "successful": [
212
+ {
213
+ "url": "https://example.com/page1",
214
+ "status": "success",
215
+ "markdown": "# Page Title\n\nContent...",
216
+ "timestamp": "2025-02-15T10:30:00.123456"
217
+ }
218
+ ],
219
+ "failed": [
220
+ {
221
+ "url": "https://example.com/page2",
222
+ "status": "failed",
223
+ "error": "HTTP 404: Not Found",
224
+ "timestamp": "2025-02-15T10:30:01.123456"
225
+ }
226
+ ]
227
+ },
228
+ "summary": {
229
+ "total": 2,
230
+ "successful": 1,
231
+ "failed": 1
232
+ }
233
+ }
234
+ ```
235
+
236
+ ### 4. Webhook Notifications
237
+ If configured, webhooks receive real-time updates in JSON format:
238
+ ```json
239
+ {
240
+ "url": "https://example.com/page1",
241
+ "status": "success",
242
+ "markdown": "# Page Title\n\nContent...",
243
+ "timestamp": "2025-02-15T10:30:00.123456",
244
+ "config": {
245
+ "target_selector": "article",
246
+ "remove_selectors": [".ads", "#popup"]
247
+ }
248
+ }
249
+ ```
250
+
251
+ ## Error Handling
252
+
253
+ The package handles various types of errors:
254
+ - Network errors
255
+ - Timeout errors
256
+ - Invalid URLs
257
+ - Missing content
258
+ - Service errors
259
+
260
+ All errors are:
261
+ 1. Logged in the console
262
+ 2. Included in the JSON report
263
+ 3. Sent via webhook (if configured)
264
+ 4. Available in the results list
265
+
266
+ ## Requirements
267
+
268
+ - Python 3.11 or later
269
+ - Running SpiderForce4AI service
270
+ - Internet connection
271
+
272
+ ## License
273
+
274
+ MIT License
275
+
276
+ ## Credits
277
+
278
+ Created by [Peter Tam](https://petertam.pro)
@@ -0,0 +1,5 @@
1
+ spiderforce4ai/__init__.py,sha256=i1lHYILqFG_Eld0ZCbBdK5F_Jk0zYr_60vS46AYZfTM,16496
2
+ spiderforce4ai-0.1.6.dist-info/METADATA,sha256=7rcL1OGqYeF1QHWUIB9xHaKYxGGegs2zHNz0UTu-ego,6575
3
+ spiderforce4ai-0.1.6.dist-info/WHEEL,sha256=In9FTNxeP60KnTkGw7wk6mJPYd_dQSjEZmXdBdMCI-8,91
4
+ spiderforce4ai-0.1.6.dist-info/top_level.txt,sha256=Kth7A21Js7DCp0j5XBBi-FE45SCLouZkeNZU__Yr9Yk,15
5
+ spiderforce4ai-0.1.6.dist-info/RECORD,,
@@ -1,239 +0,0 @@
1
- Metadata-Version: 2.2
2
- Name: spiderforce4ai
3
- Version: 0.1.5
4
- Summary: Python wrapper for SpiderForce4AI HTML-to-Markdown conversion service
5
- Home-page: https://petertam.pro
6
- Author: Piotr Tamulewicz
7
- Author-email: Piotr Tamulewicz <pt@petertam.pro>
8
- License: MIT
9
- Classifier: Development Status :: 4 - Beta
10
- Classifier: Intended Audience :: Developers
11
- Classifier: License :: OSI Approved :: MIT License
12
- Classifier: Programming Language :: Python :: 3.11
13
- Classifier: Programming Language :: Python :: 3.12
14
- Requires-Python: >=3.11
15
- Description-Content-Type: text/markdown
16
- Requires-Dist: aiohttp>=3.8.0
17
- Requires-Dist: asyncio>=3.4.3
18
- Requires-Dist: rich>=10.0.0
19
- Requires-Dist: aiofiles>=0.8.0
20
- Requires-Dist: httpx>=0.24.0
21
- Dynamic: author
22
- Dynamic: home-page
23
- Dynamic: requires-python
24
-
25
- # SpiderForce4AI Python Wrapper (Jina ai reader, fFrecrawl alternative)
26
-
27
- A Python wrapper for SpiderForce4AI - a powerful HTML-to-Markdown conversion service. This package provides an easy-to-use interface for crawling websites and converting their content to clean Markdown format.
28
-
29
- ## Features
30
-
31
- - 🔄 Simple synchronous and asynchronous APIs
32
- - 📁 Automatic Markdown file saving with URL-based filenames
33
- - 📊 Real-time progress tracking in console
34
- - 🪝 Webhook support for real-time notifications
35
- - 📝 Detailed crawl reports in JSON format
36
- - ⚡ Concurrent crawling with rate limiting
37
- - 🔍 Support for sitemap.xml crawling
38
- - 🛡️ Comprehensive error handling
39
-
40
- ## Installation
41
-
42
- ```bash
43
- pip install spiderforce4ai
44
- ```
45
-
46
- ## Quick Start
47
-
48
- ```python
49
- from spiderforce4ai import SpiderForce4AI, CrawlConfig
50
-
51
- # Initialize the client
52
- spider = SpiderForce4AI("http://localhost:3004")
53
-
54
- # Use default configuration
55
- config = CrawlConfig()
56
-
57
- # Crawl a single URL
58
- result = spider.crawl_url("https://example.com", config)
59
-
60
- # Crawl multiple URLs
61
- urls = [
62
- "https://example.com/page1",
63
- "https://example.com/page2"
64
- ]
65
- results = spider.crawl_urls(urls, config)
66
-
67
- # Crawl from sitemap
68
- results = spider.crawl_sitemap("https://example.com/sitemap.xml", config)
69
- ```
70
-
71
- ## Configuration
72
-
73
- The `CrawlConfig` class provides various configuration options. All parameters are optional with sensible defaults:
74
-
75
- ```python
76
- config = CrawlConfig(
77
- # Content Selection (all optional)
78
- target_selector="article", # Specific element to target
79
- remove_selectors=[".ads", "#popup"], # Elements to remove
80
- remove_selectors_regex=["modal-\\d+"], # Regex patterns for removal
81
-
82
- # Processing Settings
83
- max_concurrent_requests=1, # Default: 1
84
- request_delay=0.5, # Delay between requests in seconds
85
- timeout=30, # Request timeout in seconds
86
-
87
- # Output Settings
88
- output_dir="spiderforce_reports", # Default output directory
89
- webhook_url="https://your-webhook.com", # Optional webhook endpoint
90
- webhook_timeout=10, # Webhook timeout in seconds
91
- report_file=None # Optional custom report location
92
- )
93
- ```
94
-
95
- ### Default Directory Structure
96
-
97
- ```
98
- ./
99
- └── spiderforce_reports/
100
- ├── example-com-page1.md
101
- ├── example-com-page2.md
102
- └── crawl_report.json
103
- ```
104
-
105
- ## Webhook Notifications
106
-
107
- If `webhook_url` is configured, the crawler sends POST requests with the following JSON structure:
108
-
109
- ```json
110
- {
111
- "url": "https://example.com/page1",
112
- "status": "success",
113
- "markdown": "# Page Title\n\nContent...",
114
- "timestamp": "2025-02-15T10:30:00.123456",
115
- "config": {
116
- "target_selector": "article",
117
- "remove_selectors": [".ads", "#popup"],
118
- "remove_selectors_regex": ["modal-\\d+"]
119
- }
120
- }
121
- ```
122
-
123
- ## Crawl Report
124
-
125
- A comprehensive JSON report is automatically generated in the output directory:
126
-
127
- ```json
128
- {
129
- "timestamp": "2025-02-15T10:30:00.123456",
130
- "config": {
131
- "target_selector": "article",
132
- "remove_selectors": [".ads", "#popup"],
133
- "remove_selectors_regex": ["modal-\\d+"]
134
- },
135
- "results": {
136
- "successful": [
137
- {
138
- "url": "https://example.com/page1",
139
- "status": "success",
140
- "markdown": "# Page Title\n\nContent...",
141
- "timestamp": "2025-02-15T10:30:00.123456"
142
- }
143
- ],
144
- "failed": [
145
- {
146
- "url": "https://example.com/page2",
147
- "status": "failed",
148
- "error": "HTTP 404: Not Found",
149
- "timestamp": "2025-02-15T10:30:01.123456"
150
- }
151
- ]
152
- },
153
- "summary": {
154
- "total": 2,
155
- "successful": 1,
156
- "failed": 1
157
- }
158
- }
159
- ```
160
-
161
- ## Async Usage
162
-
163
- ```python
164
- import asyncio
165
- from spiderforce4ai import SpiderForce4AI, CrawlConfig
166
-
167
- async def main():
168
- config = CrawlConfig()
169
- spider = SpiderForce4AI("http://localhost:3004")
170
-
171
- async with spider:
172
- results = await spider.crawl_urls_async(
173
- ["https://example.com/page1", "https://example.com/page2"],
174
- config
175
- )
176
-
177
- return results
178
-
179
- if __name__ == "__main__":
180
- results = asyncio.run(main())
181
- ```
182
-
183
- ## Error Handling
184
-
185
- The crawler is designed to be resilient:
186
- - Continues processing even if some URLs fail
187
- - Records all errors in the crawl report
188
- - Sends error notifications via webhook if configured
189
- - Provides clear error messages in console output
190
-
191
- ## Progress Tracking
192
-
193
- The crawler provides real-time progress tracking in the console:
194
-
195
- ```
196
- 🔄 Crawling URLs... [####################] 100%
197
- ✓ Successful: 95
198
- ✗ Failed: 5
199
- 📊 Report saved to: ./spiderforce_reports/crawl_report.json
200
- ```
201
-
202
- ## Usage with AI Agents
203
-
204
- The package is designed to be easily integrated with AI agents and chat systems:
205
-
206
- ```python
207
- from spiderforce4ai import SpiderForce4AI, CrawlConfig
208
-
209
- def fetch_content_for_ai(urls):
210
- spider = SpiderForce4AI("http://localhost:3004")
211
- config = CrawlConfig()
212
-
213
- # Crawl content
214
- results = spider.crawl_urls(urls, config)
215
-
216
- # Return successful results
217
- return {
218
- result.url: result.markdown
219
- for result in results
220
- if result.status == "success"
221
- }
222
-
223
- # Use with AI agent
224
- urls = ["https://example.com/article1", "https://example.com/article2"]
225
- content = fetch_content_for_ai(urls)
226
- ```
227
-
228
- ## Requirements
229
-
230
- - Python 3.11 or later
231
- - Docker (for running SpiderForce4AI service)
232
-
233
- ## License
234
-
235
- MIT License
236
-
237
- ## Credits
238
-
239
- Created by [Peter Tam](https://petertam.pro)
@@ -1,5 +0,0 @@
1
- spiderforce4ai/__init__.py,sha256=i1lHYILqFG_Eld0ZCbBdK5F_Jk0zYr_60vS46AYZfTM,16496
2
- spiderforce4ai-0.1.5.dist-info/METADATA,sha256=Fm5H-qr4CBfJAVKXyJXsABYib_Vhvn2iUb6T6qSidHg,6214
3
- spiderforce4ai-0.1.5.dist-info/WHEEL,sha256=In9FTNxeP60KnTkGw7wk6mJPYd_dQSjEZmXdBdMCI-8,91
4
- spiderforce4ai-0.1.5.dist-info/top_level.txt,sha256=Kth7A21Js7DCp0j5XBBi-FE45SCLouZkeNZU__Yr9Yk,15
5
- spiderforce4ai-0.1.5.dist-info/RECORD,,