wxpath 0.2.0__tar.gz → 0.4.0__tar.gz
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- {wxpath-0.2.0/src/wxpath.egg-info → wxpath-0.4.0}/PKG-INFO +165 -42
- {wxpath-0.2.0 → wxpath-0.4.0}/README.md +156 -39
- {wxpath-0.2.0 → wxpath-0.4.0}/pyproject.toml +9 -5
- wxpath-0.4.0/src/wxpath/cli.py +137 -0
- wxpath-0.4.0/src/wxpath/core/ops.py +278 -0
- wxpath-0.4.0/src/wxpath/core/parser.py +598 -0
- {wxpath-0.2.0 → wxpath-0.4.0}/src/wxpath/core/runtime/engine.py +177 -48
- {wxpath-0.2.0 → wxpath-0.4.0}/src/wxpath/core/runtime/helpers.py +0 -7
- {wxpath-0.2.0 → wxpath-0.4.0}/src/wxpath/hooks/registry.py +29 -17
- wxpath-0.4.0/src/wxpath/http/client/cache.py +43 -0
- wxpath-0.4.0/src/wxpath/http/client/crawler.py +315 -0
- {wxpath-0.2.0 → wxpath-0.4.0}/src/wxpath/http/client/request.py +6 -3
- {wxpath-0.2.0 → wxpath-0.4.0}/src/wxpath/http/client/response.py +1 -1
- wxpath-0.4.0/src/wxpath/http/policy/robots.py +82 -0
- {wxpath-0.2.0 → wxpath-0.4.0}/src/wxpath/http/stats.py +6 -0
- wxpath-0.4.0/src/wxpath/settings.py +108 -0
- {wxpath-0.2.0 → wxpath-0.4.0/src/wxpath.egg-info}/PKG-INFO +165 -42
- {wxpath-0.2.0 → wxpath-0.4.0}/src/wxpath.egg-info/SOURCES.txt +3 -1
- wxpath-0.4.0/src/wxpath.egg-info/requires.txt +20 -0
- wxpath-0.2.0/src/wxpath/cli.py +0 -52
- wxpath-0.2.0/src/wxpath/core/errors.py +0 -134
- wxpath-0.2.0/src/wxpath/core/ops.py +0 -244
- wxpath-0.2.0/src/wxpath/core/parser.py +0 -319
- wxpath-0.2.0/src/wxpath/http/client/crawler.py +0 -196
- wxpath-0.2.0/src/wxpath.egg-info/requires.txt +0 -11
- {wxpath-0.2.0 → wxpath-0.4.0}/LICENSE +0 -0
- {wxpath-0.2.0 → wxpath-0.4.0}/setup.cfg +0 -0
- {wxpath-0.2.0 → wxpath-0.4.0}/src/wxpath/__init__.py +0 -0
- {wxpath-0.2.0 → wxpath-0.4.0}/src/wxpath/core/__init__.py +0 -0
- {wxpath-0.2.0 → wxpath-0.4.0}/src/wxpath/core/dom.py +0 -0
- {wxpath-0.2.0 → wxpath-0.4.0}/src/wxpath/core/models.py +0 -0
- {wxpath-0.2.0 → wxpath-0.4.0}/src/wxpath/core/runtime/__init__.py +0 -0
- {wxpath-0.2.0 → wxpath-0.4.0}/src/wxpath/hooks/__init__.py +0 -0
- {wxpath-0.2.0 → wxpath-0.4.0}/src/wxpath/hooks/builtin.py +0 -0
- {wxpath-0.2.0 → wxpath-0.4.0}/src/wxpath/http/__init__.py +0 -0
- {wxpath-0.2.0 → wxpath-0.4.0}/src/wxpath/http/client/__init__.py +0 -0
- {wxpath-0.2.0 → wxpath-0.4.0}/src/wxpath/http/policy/backoff.py +0 -0
- {wxpath-0.2.0 → wxpath-0.4.0}/src/wxpath/http/policy/retry.py +0 -0
- {wxpath-0.2.0 → wxpath-0.4.0}/src/wxpath/http/policy/throttler.py +0 -0
- {wxpath-0.2.0 → wxpath-0.4.0}/src/wxpath/patches.py +0 -0
- {wxpath-0.2.0 → wxpath-0.4.0}/src/wxpath/util/__init__.py +0 -0
- {wxpath-0.2.0 → wxpath-0.4.0}/src/wxpath/util/logging.py +0 -0
- {wxpath-0.2.0 → wxpath-0.4.0}/src/wxpath/util/serialize.py +0 -0
- {wxpath-0.2.0 → wxpath-0.4.0}/src/wxpath.egg-info/dependency_links.txt +0 -0
- {wxpath-0.2.0 → wxpath-0.4.0}/src/wxpath.egg-info/entry_points.txt +0 -0
- {wxpath-0.2.0 → wxpath-0.4.0}/src/wxpath.egg-info/top_level.txt +0 -0
|
@@ -1,16 +1,22 @@
|
|
|
1
1
|
Metadata-Version: 2.4
|
|
2
2
|
Name: wxpath
|
|
3
|
-
Version: 0.
|
|
3
|
+
Version: 0.4.0
|
|
4
4
|
Summary: wxpath - a declarative web crawler and data extractor
|
|
5
5
|
Author-email: Rodrigo Palacios <rodrigopala91@gmail.com>
|
|
6
6
|
License-Expression: MIT
|
|
7
|
-
Requires-Python: >=3.
|
|
7
|
+
Requires-Python: >=3.10
|
|
8
8
|
Description-Content-Type: text/markdown
|
|
9
9
|
License-File: LICENSE
|
|
10
|
-
Requires-Dist: requests>=2.0
|
|
11
10
|
Requires-Dist: lxml>=4.0
|
|
12
11
|
Requires-Dist: elementpath<=5.0.3,>=5.0.0
|
|
13
12
|
Requires-Dist: aiohttp<=3.12.15,>=3.8.0
|
|
13
|
+
Requires-Dist: tqdm>=4.0.0
|
|
14
|
+
Provides-Extra: cache
|
|
15
|
+
Requires-Dist: aiohttp-client-cache>=0.14.0; extra == "cache"
|
|
16
|
+
Provides-Extra: cache-sqlite
|
|
17
|
+
Requires-Dist: aiohttp-client-cache[sqlite]; extra == "cache-sqlite"
|
|
18
|
+
Provides-Extra: cache-redis
|
|
19
|
+
Requires-Dist: aiohttp-client-cache[redis]; extra == "cache-redis"
|
|
14
20
|
Provides-Extra: test
|
|
15
21
|
Requires-Dist: pytest>=7.0; extra == "test"
|
|
16
22
|
Requires-Dist: pytest-asyncio>=0.23; extra == "test"
|
|
@@ -18,12 +24,13 @@ Provides-Extra: dev
|
|
|
18
24
|
Requires-Dist: ruff; extra == "dev"
|
|
19
25
|
Dynamic: license-file
|
|
20
26
|
|
|
27
|
+
# **wxpath** - declarative web crawling with XPath
|
|
21
28
|
|
|
22
|
-
|
|
29
|
+
[](https://www.python.org/downloads/release/python-3100/)
|
|
23
30
|
|
|
24
|
-
**wxpath** is a declarative web crawler where traversal is expressed directly in XPath. Instead of writing imperative crawl loops, you describe what to follow and what to extract in a single expression. **wxpath**
|
|
31
|
+
**wxpath** is a declarative web crawler where traversal is expressed directly in XPath. Instead of writing imperative crawl loops, wxpath lets you describe what to follow and what to extract in a single expression. **wxpath** executes that expression concurrently, breadth-first-*ish*, and streams results as they are discovered.
|
|
25
32
|
|
|
26
|
-
By introducing the `url(...)` operator and the `///` syntax,
|
|
33
|
+
By introducing the `url(...)` operator and the `///` syntax, wxpath's engine is able to perform deep (or paginated) web crawling and extraction.
|
|
27
34
|
|
|
28
35
|
NOTE: This project is in early development. Core concepts are stable, but the API and features may change. Please report issues - in particular, deadlocked crawls or unexpected behavior - and any features you'd like to see (no guarantee they'll be implemented).
|
|
29
36
|
|
|
@@ -31,19 +38,26 @@ NOTE: This project is in early development. Core concepts are stable, but the AP
|
|
|
31
38
|
## Contents
|
|
32
39
|
|
|
33
40
|
- [Example](#example)
|
|
34
|
-
- [
|
|
41
|
+
- [Language Design](DESIGN.md)
|
|
42
|
+
- [`url(...)` and `///url(...)` Explained](#url-and-url-explained)
|
|
35
43
|
- [General flow](#general-flow)
|
|
36
44
|
- [Asynchronous Crawling](#asynchronous-crawling)
|
|
45
|
+
- [Polite Crawling](#polite-crawling)
|
|
37
46
|
- [Output types](#output-types)
|
|
38
|
-
- [XPath 3.1
|
|
47
|
+
- [XPath 3.1](#xpath-31-by-default)
|
|
48
|
+
- [Progress Bar](#progress-bar)
|
|
39
49
|
- [CLI](#cli)
|
|
50
|
+
- [Persistence and Caching](#persistence-and-caching)
|
|
51
|
+
- [Settings](#settings)
|
|
40
52
|
- [Hooks (Experimental)](#hooks-experimental)
|
|
41
53
|
- [Install](#install)
|
|
42
|
-
- [More Examples](
|
|
54
|
+
- [More Examples](EXAMPLES.md)
|
|
43
55
|
- [Comparisons](#comparisons)
|
|
44
56
|
- [Advanced: Engine & Crawler Configuration](#advanced-engine--crawler-configuration)
|
|
45
57
|
- [Project Philosophy](#project-philosophy)
|
|
46
58
|
- [Warnings](#warnings)
|
|
59
|
+
- [Commercial support / consulting](#commercial-support--consulting)
|
|
60
|
+
- [Versioning](#versioning)
|
|
47
61
|
- [License](#license)
|
|
48
62
|
|
|
49
63
|
|
|
@@ -51,34 +65,40 @@ NOTE: This project is in early development. Core concepts are stable, but the AP
|
|
|
51
65
|
|
|
52
66
|
```python
|
|
53
67
|
import wxpath
|
|
68
|
+
from wxpath.settings import CRAWLER_SETTINGS
|
|
54
69
|
|
|
55
|
-
|
|
70
|
+
# Custom headers for politeness; necessary for some sites (e.g., Wikipedia)
|
|
71
|
+
CRAWLER_SETTINGS.headers = {'User-Agent': 'my-app/0.4.0 (contact: you@example.com)'}
|
|
72
|
+
|
|
73
|
+
# Crawl, extract fields, build a knowledge graph
|
|
74
|
+
path_expr = """
|
|
56
75
|
url('https://en.wikipedia.org/wiki/Expression_language')
|
|
57
|
-
|
|
58
|
-
|
|
59
|
-
|
|
60
|
-
|
|
61
|
-
|
|
62
|
-
|
|
76
|
+
///url(//main//a/@href[starts-with(., '/wiki/') and not(contains(., ':'))])
|
|
77
|
+
/map{
|
|
78
|
+
'title': (//span[contains(@class, "mw-page-title-main")]/text())[1] ! string(.),
|
|
79
|
+
'url': string(base-uri(.)),
|
|
80
|
+
'short_description': //div[contains(@class, 'shortdescription')]/text() ! string(.),
|
|
81
|
+
'forward_links': //div[@id="mw-content-text"]//a/@href ! string(.)
|
|
82
|
+
}
|
|
63
83
|
"""
|
|
64
84
|
|
|
65
|
-
for item in wxpath.wxpath_async_blocking_iter(
|
|
85
|
+
for item in wxpath.wxpath_async_blocking_iter(path_expr, max_depth=1):
|
|
66
86
|
print(item)
|
|
67
87
|
```
|
|
68
88
|
|
|
69
89
|
Output:
|
|
70
90
|
|
|
71
91
|
```python
|
|
72
|
-
map{'title':
|
|
73
|
-
map{'title':
|
|
74
|
-
map{'title':
|
|
75
|
-
|
|
76
|
-
map{'title': TextNode('Data Analysis Expressions'), 'url': 'https://en.wikipedia.org/wiki/Data_Analysis_Expressions', 'short_description': TextNode('Formula and data query language')}
|
|
77
|
-
map{'title': TextNode('Domain knowledge'), 'url': 'https://en.wikipedia.org/wiki/Domain_knowledge', 'short_description': TextNode('Specialist knowledge within a specific field')}
|
|
78
|
-
map{'title': TextNode('Rights Expression Language'), 'url': 'https://en.wikipedia.org/wiki/Rights_Expression_Language', 'short_description': TextNode('Machine-processable language used to express intellectual property rights (such as copyright)')}
|
|
79
|
-
map{'title': TextNode('Computer science'), 'url': 'https://en.wikipedia.org/wiki/Computer_science', 'short_description': TextNode('Study of computation')}
|
|
92
|
+
map{'title': 'Computer language', 'url': 'https://en.wikipedia.org/wiki/Computer_language', 'short_description': 'Formal language for communicating with a computer', 'forward_links': ['/wiki/Formal_language', '/wiki/Communication', ...]}
|
|
93
|
+
map{'title': 'Advanced Boolean Expression Language', 'url': 'https://en.wikipedia.org/wiki/Advanced_Boolean_Expression_Language', 'short_description': 'Hardware description language and software', 'forward_links': ['/wiki/File:ABEL_HDL_example_SN74162.png', '/wiki/Hardware_description_language', ...]}
|
|
94
|
+
map{'title': 'Machine-readable medium and data', 'url': 'https://en.wikipedia.org/wiki/Machine_readable', 'short_description': 'Medium capable of storing data in a format readable by a machine', 'forward_links': ['/wiki/File:EAN-13-ISBN-13.svg', '/wiki/ISBN', ...]}
|
|
95
|
+
...
|
|
80
96
|
```
|
|
81
97
|
|
|
98
|
+
**Note:** Some sites (including Wikipedia) may block requests without proper headers.
|
|
99
|
+
See [Advanced: Engine & Crawler Configuration](#advanced-engine--crawler-configuration) to set a custom `User-Agent`.
|
|
100
|
+
|
|
101
|
+
|
|
82
102
|
The above expression does the following:
|
|
83
103
|
|
|
84
104
|
1. Starts at the specified URL, `https://en.wikipedia.org/wiki/Expression_language`.
|
|
@@ -92,18 +112,23 @@ The above expression does the following:
|
|
|
92
112
|
## `url(...)` and `///url(...)` Explained
|
|
93
113
|
|
|
94
114
|
- `url(...)` is a custom operator that fetches the content of the user-specified or internally generated URL and returns it as an `lxml.html.HtmlElement` for further XPath processing.
|
|
95
|
-
- `///url(...)` indicates
|
|
115
|
+
- `///url(...)` indicates a deep crawl. It tells the runtime engine to continue following links up to the specified `max_depth`. Unlike repeated `url()` hops, it allows a single expression to describe deeper graph exploration. WARNING: Use with caution and constraints (via `max_depth` or XPath predicates) to avoid traversal explosion.
|
|
116
|
+
|
|
117
|
+
|
|
118
|
+
## Language Design
|
|
119
|
+
|
|
120
|
+
See [DESIGN.md](DESIGN.md) for details of the language design. You will see the core concepts and design the language from the ground up.
|
|
96
121
|
|
|
97
122
|
|
|
98
123
|
## General flow
|
|
99
124
|
|
|
100
125
|
**wxpath** evaluates an expression as a list of traversal and extraction steps (internally referred to as `Segment`s).
|
|
101
126
|
|
|
102
|
-
`url(...)` creates crawl tasks either statically (via a fixed URL) or dynamically (via a URL derived from the XPath expression). **URLs are deduplicated globally,
|
|
127
|
+
`url(...)` creates crawl tasks either statically (via a fixed URL) or dynamically (via a URL derived from the XPath expression). **URLs are deduplicated globally, on a best-effort basis - not per-depth**.
|
|
103
128
|
|
|
104
129
|
XPath segments operate on fetched documents (fetched via the immediately preceding `url(...)` operations).
|
|
105
130
|
|
|
106
|
-
`///url(...)` indicates
|
|
131
|
+
`///url(...)` indicates deep crawling - it proceeds breadth-first-*ish* up to `max_depth`.
|
|
107
132
|
|
|
108
133
|
Results are yielded as soon as they are ready.
|
|
109
134
|
|
|
@@ -128,7 +153,7 @@ asyncio.run(main())
|
|
|
128
153
|
|
|
129
154
|
### Blocking, Concurrent Requests
|
|
130
155
|
|
|
131
|
-
**wxpath** also
|
|
156
|
+
**wxpath** also provides an asyncio-in-sync API, allowing you to crawl multiple pages concurrently while maintaining the simplicity of synchronous code. This is particularly useful for crawls in strictly synchronous execution environments (i.e., not inside an `asyncio` event loop) where performance is a concern.
|
|
132
157
|
|
|
133
158
|
```python
|
|
134
159
|
from wxpath import wxpath_async_blocking_iter
|
|
@@ -137,10 +162,14 @@ path_expr = "url('https://en.wikipedia.org/wiki/Expression_language')///url(//@h
|
|
|
137
162
|
items = list(wxpath_async_blocking_iter(path_expr, max_depth=1))
|
|
138
163
|
```
|
|
139
164
|
|
|
165
|
+
## Polite Crawling
|
|
166
|
+
|
|
167
|
+
**wxpath** respects [robots.txt](https://en.wikipedia.org/wiki/Robots_exclusion_standard) by default via the `WXPathEngine(..., robotstxt=True)` constructor.
|
|
168
|
+
|
|
140
169
|
|
|
141
170
|
## Output types
|
|
142
171
|
|
|
143
|
-
The wxpath Python API yields structured objects
|
|
172
|
+
The wxpath Python API yields structured objects.
|
|
144
173
|
|
|
145
174
|
Depending on the expression, results may include:
|
|
146
175
|
|
|
@@ -181,6 +210,17 @@ path_expr = """
|
|
|
181
210
|
# ...]
|
|
182
211
|
```
|
|
183
212
|
|
|
213
|
+
## Progress Bar
|
|
214
|
+
|
|
215
|
+
**wxpath** provides a progress bar (via `tqdm`) to track crawl progress. This is especially useful for long-running crawls.
|
|
216
|
+
|
|
217
|
+
Enable by setting `engine.run(..., progress=True)`, or pass `progress=True` to any of the `wxpath_async*(...)` functions.
|
|
218
|
+
|
|
219
|
+
```python
|
|
220
|
+
items = wxpath.wxpath_async_blocking("...", progress=True)
|
|
221
|
+
> 100%|██████████████████████████████████████████████████████████▎| 469/471 [00:05<00:00, 72.00it/s, depth=2, yielded=457]
|
|
222
|
+
```
|
|
223
|
+
|
|
184
224
|
|
|
185
225
|
## CLI
|
|
186
226
|
|
|
@@ -188,10 +228,11 @@ path_expr = """
|
|
|
188
228
|
|
|
189
229
|
The following example demonstrates how to crawl Wikipedia starting from the "Expression language" page, extract links to other wiki pages, and retrieve specific fields from each linked page.
|
|
190
230
|
|
|
191
|
-
|
|
231
|
+
NOTE: Due to the everchanging nature of web content, the output may vary over time.
|
|
192
232
|
```bash
|
|
193
|
-
> wxpath --depth 1
|
|
194
|
-
|
|
233
|
+
> wxpath --depth 1 \
|
|
234
|
+
--header "User-Agent: my-app/0.1 (contact: you@example.com)" \
|
|
235
|
+
"url('https://en.wikipedia.org/wiki/Expression_language') \
|
|
195
236
|
///url(//div[@id='mw-content-text']//a/@href[starts-with(., '/wiki/') \
|
|
196
237
|
and not(matches(@href, '^(?:/wiki/)?(?:Wikipedia|File|Template|Special|Template_talk|Help):'))]) \
|
|
197
238
|
/map{ \
|
|
@@ -212,6 +253,55 @@ WARNING: Due to the everchanging nature of web content, the output may vary over
|
|
|
212
253
|
{"title": "Computer science", "short_description": "Study of computation", "url": "https://en.wikipedia.org/wiki/Computer_science", "backlink": "https://en.wikipedia.org/wiki/Expression_language", "depth": 1.0}
|
|
213
254
|
```
|
|
214
255
|
|
|
256
|
+
Command line options:
|
|
257
|
+
|
|
258
|
+
```bash
|
|
259
|
+
--depth <depth> Max crawl depth
|
|
260
|
+
--verbose [true|false] Provides superficial CLI information
|
|
261
|
+
--debug [true|false] Provides verbose runtime output and information
|
|
262
|
+
--concurrency <concurrency> Number of concurrent fetches
|
|
263
|
+
--concurrency-per-host <concurrency> Number of concurrent fetches per host
|
|
264
|
+
--header "Key:Value" Add a custom header (e.g., 'Key:Value'). Can be used multiple times.
|
|
265
|
+
--respect-robots [true|false] (Default: True) Respects robots.txt
|
|
266
|
+
--cache [true|false] (Default: False) Persist crawl results to a local database
|
|
267
|
+
```
|
|
268
|
+
|
|
269
|
+
|
|
270
|
+
## Persistence and Caching
|
|
271
|
+
|
|
272
|
+
**wxpath** optionally persists crawl results to a local database. This is especially useful when you're crawling a large number of URLs, and you decide to pause the crawl, change extraction expressions, or otherwise need to restart the crawl.
|
|
273
|
+
|
|
274
|
+
**wxpath** supports two backends: sqlite and redis. SQLite is great for small-scale crawls, with a single worker (i.e., `engine.crawler.concurrency == 1`). Redis is great for large-scale crawls, with multiple workers. You will be encounter a warning if you `min(engine.crawler.concurrency, engine.crawler.per_host) > 1` when using the sqlite backend.
|
|
275
|
+
|
|
276
|
+
To use, you must install the appropriate optional dependency:
|
|
277
|
+
|
|
278
|
+
```bash
|
|
279
|
+
pip install wxpath[cache-sqlite]
|
|
280
|
+
pip install wxpath[cache-redis]
|
|
281
|
+
```
|
|
282
|
+
|
|
283
|
+
Once the dependency is installed, you must enable the cache:
|
|
284
|
+
|
|
285
|
+
```python
|
|
286
|
+
from wxpath.settings import SETTINGS
|
|
287
|
+
|
|
288
|
+
# To enable caching; sqlite is the default
|
|
289
|
+
SETTINGS.http.client.cache.enabled = True
|
|
290
|
+
|
|
291
|
+
# For redis backend
|
|
292
|
+
SETTINGS.http.client.cache.enabled = True
|
|
293
|
+
SETTINGS.http.client.cache.backend = "redis"
|
|
294
|
+
SETTINGS.http.client.cache.redis.address = "redis://localhost:6379/0"
|
|
295
|
+
|
|
296
|
+
# Run wxpath as usual
|
|
297
|
+
items = list(wxpath_async_blocking_iter('...', max_depth=1, engine=engine))
|
|
298
|
+
```
|
|
299
|
+
|
|
300
|
+
|
|
301
|
+
## Settings
|
|
302
|
+
|
|
303
|
+
See [settings.py](src/wxpath/settings.py) for details of the settings.
|
|
304
|
+
|
|
215
305
|
|
|
216
306
|
## Hooks (Experimental)
|
|
217
307
|
|
|
@@ -257,10 +347,19 @@ hooks.register(hooks.JSONLWriter)
|
|
|
257
347
|
|
|
258
348
|
## Install
|
|
259
349
|
|
|
350
|
+
Requires Python 3.10+.
|
|
351
|
+
|
|
260
352
|
```
|
|
261
353
|
pip install wxpath
|
|
262
354
|
```
|
|
263
355
|
|
|
356
|
+
For persisted/cached, wxpath supports the following backends:
|
|
357
|
+
|
|
358
|
+
```
|
|
359
|
+
pip install wxpath[cache-sqlite]
|
|
360
|
+
pip install wxpath[cache-redis]
|
|
361
|
+
```
|
|
362
|
+
|
|
264
363
|
|
|
265
364
|
## More Examples
|
|
266
365
|
|
|
@@ -285,13 +384,20 @@ crawler = Crawler(
|
|
|
285
384
|
concurrency=8,
|
|
286
385
|
per_host=2,
|
|
287
386
|
timeout=10,
|
|
387
|
+
respect_robots=False,
|
|
388
|
+
headers={
|
|
389
|
+
"User-Agent": "my-app/0.1.0 (contact: you@example.com)", # Sites like Wikipedia will appreciate this
|
|
390
|
+
},
|
|
288
391
|
)
|
|
289
392
|
|
|
290
393
|
# If `crawler` is not specified, a default Crawler will be created with
|
|
291
|
-
# the provided concurrency and
|
|
394
|
+
# the provided concurrency, per_host, and respect_robots values, or with defaults.
|
|
292
395
|
engine = WXPathEngine(
|
|
293
|
-
# concurrency=16,
|
|
294
|
-
# per_host=8,
|
|
396
|
+
# concurrency: int = 16,
|
|
397
|
+
# per_host: int = 8,
|
|
398
|
+
# respect_robots: bool = True,
|
|
399
|
+
# allowed_response_codes: set[int] = {200},
|
|
400
|
+
# allow_redirects: bool = True,
|
|
295
401
|
crawler=crawler,
|
|
296
402
|
)
|
|
297
403
|
|
|
@@ -305,33 +411,50 @@ items = list(wxpath_async_blocking_iter(path_expr, max_depth=1, engine=engine))
|
|
|
305
411
|
|
|
306
412
|
### Principles
|
|
307
413
|
|
|
308
|
-
- Enable declarative,
|
|
414
|
+
- Enable declarative, crawling and scraping without boilerplate
|
|
309
415
|
- Stay lightweight and composable
|
|
310
416
|
- Asynchronous support for high-performance crawls
|
|
311
417
|
|
|
312
|
-
###
|
|
418
|
+
### Goals
|
|
313
419
|
|
|
314
420
|
- URLs are deduplicated on a best-effort, per-crawl basis.
|
|
315
421
|
- Crawls are intended to terminate once the frontier is exhausted or `max_depth` is reached.
|
|
316
422
|
- Requests are performed concurrently.
|
|
317
423
|
- Results are streamed as soon as they are available.
|
|
318
424
|
|
|
319
|
-
###
|
|
425
|
+
### Limitations (for now)
|
|
426
|
+
|
|
427
|
+
The following features are not yet supported:
|
|
320
428
|
|
|
321
|
-
- Strict result ordering
|
|
322
|
-
- Persistent scheduling or crawl resumption
|
|
323
429
|
- Automatic proxy rotation
|
|
324
430
|
- Browser-based rendering (JavaScript execution)
|
|
431
|
+
- Strict result ordering
|
|
325
432
|
|
|
326
433
|
|
|
327
434
|
## WARNINGS!!!
|
|
328
435
|
|
|
329
436
|
- Be respectful when crawling websites. A scrapy-inspired throttler is enabled by default.
|
|
330
|
-
-
|
|
437
|
+
- Deep crawls (`///`) require user discipline to avoid unbounded expansion (traversal explosion).
|
|
331
438
|
- Deadlocks and hangs are possible in certain situations (e.g., all tasks waiting on blocked requests). Please report issues if you encounter such behavior.
|
|
332
439
|
- Consider using timeouts, `max_depth`, and XPath predicates and filters to limit crawl scope.
|
|
333
440
|
|
|
334
441
|
|
|
442
|
+
## Commercial support / consulting
|
|
443
|
+
|
|
444
|
+
If you want help building or operating crawlers/data feeds with wxpath (extraction, scheduling, monitoring, breakage fixes) or other web-scraping needs, please contact me at: rodrigopala91@gmail.com.
|
|
445
|
+
|
|
446
|
+
|
|
447
|
+
### Donate
|
|
448
|
+
|
|
449
|
+
If you like wxpath and want to support its development, please consider [donating](https://www.paypal.com/donate/?business=WDNDK6J6PJEXY&no_recurring=0&item_name=Thanks+for+using+wxpath%21+Donations+fund+development%2C+docs%2C+and+bug+fixes.+If+wxpath+saved+you+time%2C+a+small+contribution+helps%21¤cy_code=USD).
|
|
450
|
+
|
|
451
|
+
|
|
452
|
+
## Versioning
|
|
453
|
+
|
|
454
|
+
**wxpath** follows [semver](https://semver.org): `<MAJOR>.<MINOR>.<PATCH>`.
|
|
455
|
+
|
|
456
|
+
However, pre-1.0.0 follows `0.<MAJOR>.<MINOR|PATCH>`.
|
|
457
|
+
|
|
335
458
|
## License
|
|
336
459
|
|
|
337
460
|
MIT
|