wxpath 0.4.0__tar.gz → 0.4.1__tar.gz
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- {wxpath-0.4.0/src/wxpath.egg-info → wxpath-0.4.1}/PKG-INFO +56 -15
- {wxpath-0.4.0 → wxpath-0.4.1}/README.md +55 -14
- {wxpath-0.4.0 → wxpath-0.4.1}/pyproject.toml +1 -1
- {wxpath-0.4.0 → wxpath-0.4.1}/src/wxpath/core/runtime/engine.py +43 -5
- {wxpath-0.4.0 → wxpath-0.4.1}/src/wxpath/http/client/crawler.py +2 -2
- {wxpath-0.4.0 → wxpath-0.4.1}/src/wxpath/http/client/request.py +1 -1
- {wxpath-0.4.0 → wxpath-0.4.1/src/wxpath.egg-info}/PKG-INFO +56 -15
- {wxpath-0.4.0 → wxpath-0.4.1}/LICENSE +0 -0
- {wxpath-0.4.0 → wxpath-0.4.1}/setup.cfg +0 -0
- {wxpath-0.4.0 → wxpath-0.4.1}/src/wxpath/__init__.py +0 -0
- {wxpath-0.4.0 → wxpath-0.4.1}/src/wxpath/cli.py +0 -0
- {wxpath-0.4.0 → wxpath-0.4.1}/src/wxpath/core/__init__.py +0 -0
- {wxpath-0.4.0 → wxpath-0.4.1}/src/wxpath/core/dom.py +0 -0
- {wxpath-0.4.0 → wxpath-0.4.1}/src/wxpath/core/models.py +0 -0
- {wxpath-0.4.0 → wxpath-0.4.1}/src/wxpath/core/ops.py +0 -0
- {wxpath-0.4.0 → wxpath-0.4.1}/src/wxpath/core/parser.py +0 -0
- {wxpath-0.4.0 → wxpath-0.4.1}/src/wxpath/core/runtime/__init__.py +0 -0
- {wxpath-0.4.0 → wxpath-0.4.1}/src/wxpath/core/runtime/helpers.py +0 -0
- {wxpath-0.4.0 → wxpath-0.4.1}/src/wxpath/hooks/__init__.py +0 -0
- {wxpath-0.4.0 → wxpath-0.4.1}/src/wxpath/hooks/builtin.py +0 -0
- {wxpath-0.4.0 → wxpath-0.4.1}/src/wxpath/hooks/registry.py +0 -0
- {wxpath-0.4.0 → wxpath-0.4.1}/src/wxpath/http/__init__.py +0 -0
- {wxpath-0.4.0 → wxpath-0.4.1}/src/wxpath/http/client/__init__.py +0 -0
- {wxpath-0.4.0 → wxpath-0.4.1}/src/wxpath/http/client/cache.py +0 -0
- {wxpath-0.4.0 → wxpath-0.4.1}/src/wxpath/http/client/response.py +0 -0
- {wxpath-0.4.0 → wxpath-0.4.1}/src/wxpath/http/policy/backoff.py +0 -0
- {wxpath-0.4.0 → wxpath-0.4.1}/src/wxpath/http/policy/retry.py +0 -0
- {wxpath-0.4.0 → wxpath-0.4.1}/src/wxpath/http/policy/robots.py +0 -0
- {wxpath-0.4.0 → wxpath-0.4.1}/src/wxpath/http/policy/throttler.py +0 -0
- {wxpath-0.4.0 → wxpath-0.4.1}/src/wxpath/http/stats.py +0 -0
- {wxpath-0.4.0 → wxpath-0.4.1}/src/wxpath/patches.py +0 -0
- {wxpath-0.4.0 → wxpath-0.4.1}/src/wxpath/settings.py +0 -0
- {wxpath-0.4.0 → wxpath-0.4.1}/src/wxpath/util/__init__.py +0 -0
- {wxpath-0.4.0 → wxpath-0.4.1}/src/wxpath/util/logging.py +0 -0
- {wxpath-0.4.0 → wxpath-0.4.1}/src/wxpath/util/serialize.py +0 -0
- {wxpath-0.4.0 → wxpath-0.4.1}/src/wxpath.egg-info/SOURCES.txt +0 -0
- {wxpath-0.4.0 → wxpath-0.4.1}/src/wxpath.egg-info/dependency_links.txt +0 -0
- {wxpath-0.4.0 → wxpath-0.4.1}/src/wxpath.egg-info/entry_points.txt +0 -0
- {wxpath-0.4.0 → wxpath-0.4.1}/src/wxpath.egg-info/requires.txt +0 -0
- {wxpath-0.4.0 → wxpath-0.4.1}/src/wxpath.egg-info/top_level.txt +0 -0
|
@@ -1,6 +1,6 @@
|
|
|
1
1
|
Metadata-Version: 2.4
|
|
2
2
|
Name: wxpath
|
|
3
|
-
Version: 0.4.
|
|
3
|
+
Version: 0.4.1
|
|
4
4
|
Summary: wxpath - a declarative web crawler and data extractor
|
|
5
5
|
Author-email: Rodrigo Palacios <rodrigopala91@gmail.com>
|
|
6
6
|
License-Expression: MIT
|
|
@@ -30,9 +30,42 @@ Dynamic: license-file
|
|
|
30
30
|
|
|
31
31
|
**wxpath** is a declarative web crawler where traversal is expressed directly in XPath. Instead of writing imperative crawl loops, wxpath lets you describe what to follow and what to extract in a single expression. **wxpath** executes that expression concurrently, breadth-first-*ish*, and streams results as they are discovered.
|
|
32
32
|
|
|
33
|
-
|
|
33
|
+
This expression fetches a page, extracts links, and streams them concurrently - no crawl loop required:
|
|
34
34
|
|
|
35
|
-
|
|
35
|
+
```python
|
|
36
|
+
import wxpath
|
|
37
|
+
|
|
38
|
+
expr = "url('https://example.com')//a/@href"
|
|
39
|
+
|
|
40
|
+
for link in wxpath.wxpath_async_blocking_iter(expr):
|
|
41
|
+
print(link)
|
|
42
|
+
```
|
|
43
|
+
|
|
44
|
+
|
|
45
|
+
By introducing the `url(...)` operator and the `///` syntax, wxpath's engine is able to perform deep (or paginated) web crawling and extraction:
|
|
46
|
+
|
|
47
|
+
```python
|
|
48
|
+
import wxpath
|
|
49
|
+
|
|
50
|
+
path_expr = """
|
|
51
|
+
url('https://quotes.toscrape.com')
|
|
52
|
+
///url(//a/@href)
|
|
53
|
+
//a/@href
|
|
54
|
+
"""
|
|
55
|
+
|
|
56
|
+
for item in wxpath.wxpath_async_blocking_iter(path_expr, max_depth=1):
|
|
57
|
+
print(item)
|
|
58
|
+
```
|
|
59
|
+
|
|
60
|
+
|
|
61
|
+
## Why wxpath?
|
|
62
|
+
|
|
63
|
+
Most web scrapers force you to write crawl control flow first, and extraction second.
|
|
64
|
+
|
|
65
|
+
**wxpath** inverts that:
|
|
66
|
+
- **You describe traversal declaratively**
|
|
67
|
+
- **Extraction is expressed inline**
|
|
68
|
+
- **The engine handles scheduling, concurrency, and deduplication**
|
|
36
69
|
|
|
37
70
|
|
|
38
71
|
## Contents
|
|
@@ -56,7 +89,7 @@ NOTE: This project is in early development. Core concepts are stable, but the AP
|
|
|
56
89
|
- [Advanced: Engine & Crawler Configuration](#advanced-engine--crawler-configuration)
|
|
57
90
|
- [Project Philosophy](#project-philosophy)
|
|
58
91
|
- [Warnings](#warnings)
|
|
59
|
-
- [Commercial support
|
|
92
|
+
- [Commercial support/consulting](#commercial-supportconsulting)
|
|
60
93
|
- [Versioning](#versioning)
|
|
61
94
|
- [License](#license)
|
|
62
95
|
|
|
@@ -73,7 +106,11 @@ CRAWLER_SETTINGS.headers = {'User-Agent': 'my-app/0.4.0 (contact: you@example.co
|
|
|
73
106
|
# Crawl, extract fields, build a knowledge graph
|
|
74
107
|
path_expr = """
|
|
75
108
|
url('https://en.wikipedia.org/wiki/Expression_language')
|
|
76
|
-
///url(
|
|
109
|
+
///url(
|
|
110
|
+
//main//a/@href[
|
|
111
|
+
starts-with(., '/wiki/') and not(contains(., ':'))
|
|
112
|
+
]
|
|
113
|
+
)
|
|
77
114
|
/map{
|
|
78
115
|
'title': (//span[contains(@class, "mw-page-title-main")]/text())[1] ! string(.),
|
|
79
116
|
'url': string(base-uri(.)),
|
|
@@ -86,15 +123,6 @@ for item in wxpath.wxpath_async_blocking_iter(path_expr, max_depth=1):
|
|
|
86
123
|
print(item)
|
|
87
124
|
```
|
|
88
125
|
|
|
89
|
-
Output:
|
|
90
|
-
|
|
91
|
-
```python
|
|
92
|
-
map{'title': 'Computer language', 'url': 'https://en.wikipedia.org/wiki/Computer_language', 'short_description': 'Formal language for communicating with a computer', 'forward_links': ['/wiki/Formal_language', '/wiki/Communication', ...]}
|
|
93
|
-
map{'title': 'Advanced Boolean Expression Language', 'url': 'https://en.wikipedia.org/wiki/Advanced_Boolean_Expression_Language', 'short_description': 'Hardware description language and software', 'forward_links': ['/wiki/File:ABEL_HDL_example_SN74162.png', '/wiki/Hardware_description_language', ...]}
|
|
94
|
-
map{'title': 'Machine-readable medium and data', 'url': 'https://en.wikipedia.org/wiki/Machine_readable', 'short_description': 'Medium capable of storing data in a format readable by a machine', 'forward_links': ['/wiki/File:EAN-13-ISBN-13.svg', '/wiki/ISBN', ...]}
|
|
95
|
-
...
|
|
96
|
-
```
|
|
97
|
-
|
|
98
126
|
**Note:** Some sites (including Wikipedia) may block requests without proper headers.
|
|
99
127
|
See [Advanced: Engine & Crawler Configuration](#advanced-engine--crawler-configuration) to set a custom `User-Agent`.
|
|
100
128
|
|
|
@@ -406,6 +434,17 @@ path_expr = "url('https://en.wikipedia.org/wiki/Expression_language')//url(//mai
|
|
|
406
434
|
items = list(wxpath_async_blocking_iter(path_expr, max_depth=1, engine=engine))
|
|
407
435
|
```
|
|
408
436
|
|
|
437
|
+
### Runtime API (`wxpath_async*`) options
|
|
438
|
+
|
|
439
|
+
- `max_depth`: int = 1
|
|
440
|
+
- `progress`: bool = False
|
|
441
|
+
- `engine`: WXPathEngine | None = None
|
|
442
|
+
- `yield_errors`: bool = False
|
|
443
|
+
|
|
444
|
+
|
|
445
|
+
### Settings
|
|
446
|
+
You can also use [settings.py](src/wxpath/settings.py) to enable caching, throttling, concurrency and more.
|
|
447
|
+
|
|
409
448
|
|
|
410
449
|
## Project Philosophy
|
|
411
450
|
|
|
@@ -433,13 +472,15 @@ The following features are not yet supported:
|
|
|
433
472
|
|
|
434
473
|
## WARNINGS!!!
|
|
435
474
|
|
|
475
|
+
This project is in early development. Core concepts are stable, but the API and features may change. Please report issues - in particular, deadlocked crawls or unexpected behavior - and any features you'd like to see (no guarantee they'll be implemented).
|
|
476
|
+
|
|
436
477
|
- Be respectful when crawling websites. A scrapy-inspired throttler is enabled by default.
|
|
437
478
|
- Deep crawls (`///`) require user discipline to avoid unbounded expansion (traversal explosion).
|
|
438
479
|
- Deadlocks and hangs are possible in certain situations (e.g., all tasks waiting on blocked requests). Please report issues if you encounter such behavior.
|
|
439
480
|
- Consider using timeouts, `max_depth`, and XPath predicates and filters to limit crawl scope.
|
|
440
481
|
|
|
441
482
|
|
|
442
|
-
## Commercial support
|
|
483
|
+
## Commercial support/consulting
|
|
443
484
|
|
|
444
485
|
If you want help building or operating crawlers/data feeds with wxpath (extraction, scheduling, monitoring, breakage fixes) or other web-scraping needs, please contact me at: rodrigopala91@gmail.com.
|
|
445
486
|
|
|
@@ -4,9 +4,42 @@
|
|
|
4
4
|
|
|
5
5
|
**wxpath** is a declarative web crawler where traversal is expressed directly in XPath. Instead of writing imperative crawl loops, wxpath lets you describe what to follow and what to extract in a single expression. **wxpath** executes that expression concurrently, breadth-first-*ish*, and streams results as they are discovered.
|
|
6
6
|
|
|
7
|
-
|
|
7
|
+
This expression fetches a page, extracts links, and streams them concurrently - no crawl loop required:
|
|
8
8
|
|
|
9
|
-
|
|
9
|
+
```python
|
|
10
|
+
import wxpath
|
|
11
|
+
|
|
12
|
+
expr = "url('https://example.com')//a/@href"
|
|
13
|
+
|
|
14
|
+
for link in wxpath.wxpath_async_blocking_iter(expr):
|
|
15
|
+
print(link)
|
|
16
|
+
```
|
|
17
|
+
|
|
18
|
+
|
|
19
|
+
By introducing the `url(...)` operator and the `///` syntax, wxpath's engine is able to perform deep (or paginated) web crawling and extraction:
|
|
20
|
+
|
|
21
|
+
```python
|
|
22
|
+
import wxpath
|
|
23
|
+
|
|
24
|
+
path_expr = """
|
|
25
|
+
url('https://quotes.toscrape.com')
|
|
26
|
+
///url(//a/@href)
|
|
27
|
+
//a/@href
|
|
28
|
+
"""
|
|
29
|
+
|
|
30
|
+
for item in wxpath.wxpath_async_blocking_iter(path_expr, max_depth=1):
|
|
31
|
+
print(item)
|
|
32
|
+
```
|
|
33
|
+
|
|
34
|
+
|
|
35
|
+
## Why wxpath?
|
|
36
|
+
|
|
37
|
+
Most web scrapers force you to write crawl control flow first, and extraction second.
|
|
38
|
+
|
|
39
|
+
**wxpath** inverts that:
|
|
40
|
+
- **You describe traversal declaratively**
|
|
41
|
+
- **Extraction is expressed inline**
|
|
42
|
+
- **The engine handles scheduling, concurrency, and deduplication**
|
|
10
43
|
|
|
11
44
|
|
|
12
45
|
## Contents
|
|
@@ -30,7 +63,7 @@ NOTE: This project is in early development. Core concepts are stable, but the AP
|
|
|
30
63
|
- [Advanced: Engine & Crawler Configuration](#advanced-engine--crawler-configuration)
|
|
31
64
|
- [Project Philosophy](#project-philosophy)
|
|
32
65
|
- [Warnings](#warnings)
|
|
33
|
-
- [Commercial support
|
|
66
|
+
- [Commercial support/consulting](#commercial-supportconsulting)
|
|
34
67
|
- [Versioning](#versioning)
|
|
35
68
|
- [License](#license)
|
|
36
69
|
|
|
@@ -47,7 +80,11 @@ CRAWLER_SETTINGS.headers = {'User-Agent': 'my-app/0.4.0 (contact: you@example.co
|
|
|
47
80
|
# Crawl, extract fields, build a knowledge graph
|
|
48
81
|
path_expr = """
|
|
49
82
|
url('https://en.wikipedia.org/wiki/Expression_language')
|
|
50
|
-
///url(
|
|
83
|
+
///url(
|
|
84
|
+
//main//a/@href[
|
|
85
|
+
starts-with(., '/wiki/') and not(contains(., ':'))
|
|
86
|
+
]
|
|
87
|
+
)
|
|
51
88
|
/map{
|
|
52
89
|
'title': (//span[contains(@class, "mw-page-title-main")]/text())[1] ! string(.),
|
|
53
90
|
'url': string(base-uri(.)),
|
|
@@ -60,15 +97,6 @@ for item in wxpath.wxpath_async_blocking_iter(path_expr, max_depth=1):
|
|
|
60
97
|
print(item)
|
|
61
98
|
```
|
|
62
99
|
|
|
63
|
-
Output:
|
|
64
|
-
|
|
65
|
-
```python
|
|
66
|
-
map{'title': 'Computer language', 'url': 'https://en.wikipedia.org/wiki/Computer_language', 'short_description': 'Formal language for communicating with a computer', 'forward_links': ['/wiki/Formal_language', '/wiki/Communication', ...]}
|
|
67
|
-
map{'title': 'Advanced Boolean Expression Language', 'url': 'https://en.wikipedia.org/wiki/Advanced_Boolean_Expression_Language', 'short_description': 'Hardware description language and software', 'forward_links': ['/wiki/File:ABEL_HDL_example_SN74162.png', '/wiki/Hardware_description_language', ...]}
|
|
68
|
-
map{'title': 'Machine-readable medium and data', 'url': 'https://en.wikipedia.org/wiki/Machine_readable', 'short_description': 'Medium capable of storing data in a format readable by a machine', 'forward_links': ['/wiki/File:EAN-13-ISBN-13.svg', '/wiki/ISBN', ...]}
|
|
69
|
-
...
|
|
70
|
-
```
|
|
71
|
-
|
|
72
100
|
**Note:** Some sites (including Wikipedia) may block requests without proper headers.
|
|
73
101
|
See [Advanced: Engine & Crawler Configuration](#advanced-engine--crawler-configuration) to set a custom `User-Agent`.
|
|
74
102
|
|
|
@@ -380,6 +408,17 @@ path_expr = "url('https://en.wikipedia.org/wiki/Expression_language')//url(//mai
|
|
|
380
408
|
items = list(wxpath_async_blocking_iter(path_expr, max_depth=1, engine=engine))
|
|
381
409
|
```
|
|
382
410
|
|
|
411
|
+
### Runtime API (`wxpath_async*`) options
|
|
412
|
+
|
|
413
|
+
- `max_depth`: int = 1
|
|
414
|
+
- `progress`: bool = False
|
|
415
|
+
- `engine`: WXPathEngine | None = None
|
|
416
|
+
- `yield_errors`: bool = False
|
|
417
|
+
|
|
418
|
+
|
|
419
|
+
### Settings
|
|
420
|
+
You can also use [settings.py](src/wxpath/settings.py) to enable caching, throttling, concurrency and more.
|
|
421
|
+
|
|
383
422
|
|
|
384
423
|
## Project Philosophy
|
|
385
424
|
|
|
@@ -407,13 +446,15 @@ The following features are not yet supported:
|
|
|
407
446
|
|
|
408
447
|
## WARNINGS!!!
|
|
409
448
|
|
|
449
|
+
This project is in early development. Core concepts are stable, but the API and features may change. Please report issues - in particular, deadlocked crawls or unexpected behavior - and any features you'd like to see (no guarantee they'll be implemented).
|
|
450
|
+
|
|
410
451
|
- Be respectful when crawling websites. A scrapy-inspired throttler is enabled by default.
|
|
411
452
|
- Deep crawls (`///`) require user discipline to avoid unbounded expansion (traversal explosion).
|
|
412
453
|
- Deadlocks and hangs are possible in certain situations (e.g., all tasks waiting on blocked requests). Please report issues if you encounter such behavior.
|
|
413
454
|
- Consider using timeouts, `max_depth`, and XPath predicates and filters to limit crawl scope.
|
|
414
455
|
|
|
415
456
|
|
|
416
|
-
## Commercial support
|
|
457
|
+
## Commercial support/consulting
|
|
417
458
|
|
|
418
459
|
If you want help building or operating crawlers/data feeds with wxpath (extraction, scheduling, monitoring, breakage fixes) or other web-scraping needs, please contact me at: rodrigopala91@gmail.com.
|
|
419
460
|
|
|
@@ -162,7 +162,8 @@ class WXPathEngine(HookedEngineBase):
|
|
|
162
162
|
self,
|
|
163
163
|
expression: str,
|
|
164
164
|
max_depth: int,
|
|
165
|
-
progress: bool = False
|
|
165
|
+
progress: bool = False,
|
|
166
|
+
yield_errors: bool = False,
|
|
166
167
|
) -> AsyncGenerator[Any, None]:
|
|
167
168
|
"""Execute a wxpath expression concurrently and yield results.
|
|
168
169
|
|
|
@@ -248,12 +249,32 @@ class WXPathEngine(HookedEngineBase):
|
|
|
248
249
|
|
|
249
250
|
if task is None:
|
|
250
251
|
log.warning(f"Got unexpected response from {resp.request.url}")
|
|
252
|
+
|
|
253
|
+
if yield_errors:
|
|
254
|
+
yield {
|
|
255
|
+
"__type__": "error",
|
|
256
|
+
"url": resp.request.url,
|
|
257
|
+
"reason": "unexpected_response",
|
|
258
|
+
"status": resp.body,
|
|
259
|
+
"body": resp.body
|
|
260
|
+
}
|
|
261
|
+
|
|
251
262
|
if is_terminal():
|
|
252
263
|
break
|
|
253
264
|
continue
|
|
254
265
|
|
|
255
266
|
if resp.error:
|
|
256
267
|
log.warning(f"Got error from {resp.request.url}: {resp.error}")
|
|
268
|
+
|
|
269
|
+
if yield_errors:
|
|
270
|
+
yield {
|
|
271
|
+
"__type__": "error",
|
|
272
|
+
"url": resp.request.url,
|
|
273
|
+
"reason": "network_error",
|
|
274
|
+
"exception": str(resp.error),
|
|
275
|
+
"status": resp.status,
|
|
276
|
+
"body": resp.body
|
|
277
|
+
}
|
|
257
278
|
if is_terminal():
|
|
258
279
|
break
|
|
259
280
|
continue
|
|
@@ -261,6 +282,16 @@ class WXPathEngine(HookedEngineBase):
|
|
|
261
282
|
# NOTE: Consider allowing redirects
|
|
262
283
|
if resp.status not in self.allowed_response_codes or not resp.body:
|
|
263
284
|
log.warning(f"Got non-200 response from {resp.request.url}")
|
|
285
|
+
|
|
286
|
+
if yield_errors:
|
|
287
|
+
yield {
|
|
288
|
+
"__type__": "error",
|
|
289
|
+
"url": resp.request.url,
|
|
290
|
+
"reason": "bad_status",
|
|
291
|
+
"status": resp.status,
|
|
292
|
+
"body": resp.body
|
|
293
|
+
}
|
|
294
|
+
|
|
264
295
|
if is_terminal():
|
|
265
296
|
break
|
|
266
297
|
continue
|
|
@@ -388,10 +419,12 @@ class WXPathEngine(HookedEngineBase):
|
|
|
388
419
|
def wxpath_async(path_expr: str,
|
|
389
420
|
max_depth: int,
|
|
390
421
|
progress: bool = False,
|
|
391
|
-
engine: WXPathEngine | None = None
|
|
422
|
+
engine: WXPathEngine | None = None,
|
|
423
|
+
yield_errors: bool = False
|
|
424
|
+
) -> AsyncGenerator[Any, None]:
|
|
392
425
|
if engine is None:
|
|
393
426
|
engine = WXPathEngine()
|
|
394
|
-
return engine.run(path_expr, max_depth, progress=progress)
|
|
427
|
+
return engine.run(path_expr, max_depth, progress=progress, yield_errors=yield_errors)
|
|
395
428
|
|
|
396
429
|
|
|
397
430
|
##### ASYNC IN SYNC #####
|
|
@@ -400,6 +433,7 @@ def wxpath_async_blocking_iter(
|
|
|
400
433
|
max_depth: int = 1,
|
|
401
434
|
progress: bool = False,
|
|
402
435
|
engine: WXPathEngine | None = None,
|
|
436
|
+
yield_errors: bool = False
|
|
403
437
|
) -> Iterator[Any]:
|
|
404
438
|
"""Evaluate a wxpath expression using concurrent breadth-first traversal.
|
|
405
439
|
|
|
@@ -419,7 +453,8 @@ def wxpath_async_blocking_iter(
|
|
|
419
453
|
"""
|
|
420
454
|
loop = asyncio.new_event_loop()
|
|
421
455
|
asyncio.set_event_loop(loop)
|
|
422
|
-
agen = wxpath_async(path_expr, max_depth=max_depth, progress=progress,
|
|
456
|
+
agen = wxpath_async(path_expr, max_depth=max_depth, progress=progress,
|
|
457
|
+
engine=engine, yield_errors=yield_errors)
|
|
423
458
|
|
|
424
459
|
try:
|
|
425
460
|
while True:
|
|
@@ -437,8 +472,11 @@ def wxpath_async_blocking(
|
|
|
437
472
|
max_depth: int = 1,
|
|
438
473
|
progress: bool = False,
|
|
439
474
|
engine: WXPathEngine | None = None,
|
|
475
|
+
yield_errors: bool = False
|
|
440
476
|
) -> list[Any]:
|
|
441
477
|
return list(wxpath_async_blocking_iter(path_expr,
|
|
442
478
|
max_depth=max_depth,
|
|
443
479
|
progress=progress,
|
|
444
|
-
engine=engine
|
|
480
|
+
engine=engine,
|
|
481
|
+
yield_errors=yield_errors,
|
|
482
|
+
))
|
|
@@ -1,7 +1,7 @@
|
|
|
1
1
|
import aiohttp
|
|
2
2
|
|
|
3
3
|
try:
|
|
4
|
-
from aiohttp_client_cache import CachedSession
|
|
4
|
+
from aiohttp_client_cache import CachedSession
|
|
5
5
|
except ImportError:
|
|
6
6
|
CachedSession = None
|
|
7
7
|
|
|
@@ -42,7 +42,7 @@ def get_async_session(
|
|
|
42
42
|
if timeout is None:
|
|
43
43
|
timeout = aiohttp.ClientTimeout(total=CRAWLER_SETTINGS.timeout)
|
|
44
44
|
|
|
45
|
-
if CACHE_SETTINGS.enabled and CachedSession
|
|
45
|
+
if CACHE_SETTINGS.enabled and CachedSession:
|
|
46
46
|
log.info("using aiohttp-client-cache")
|
|
47
47
|
return CachedSession(
|
|
48
48
|
cache=get_cache_backend(),
|
|
@@ -1,6 +1,6 @@
|
|
|
1
1
|
Metadata-Version: 2.4
|
|
2
2
|
Name: wxpath
|
|
3
|
-
Version: 0.4.
|
|
3
|
+
Version: 0.4.1
|
|
4
4
|
Summary: wxpath - a declarative web crawler and data extractor
|
|
5
5
|
Author-email: Rodrigo Palacios <rodrigopala91@gmail.com>
|
|
6
6
|
License-Expression: MIT
|
|
@@ -30,9 +30,42 @@ Dynamic: license-file
|
|
|
30
30
|
|
|
31
31
|
**wxpath** is a declarative web crawler where traversal is expressed directly in XPath. Instead of writing imperative crawl loops, wxpath lets you describe what to follow and what to extract in a single expression. **wxpath** executes that expression concurrently, breadth-first-*ish*, and streams results as they are discovered.
|
|
32
32
|
|
|
33
|
-
|
|
33
|
+
This expression fetches a page, extracts links, and streams them concurrently - no crawl loop required:
|
|
34
34
|
|
|
35
|
-
|
|
35
|
+
```python
|
|
36
|
+
import wxpath
|
|
37
|
+
|
|
38
|
+
expr = "url('https://example.com')//a/@href"
|
|
39
|
+
|
|
40
|
+
for link in wxpath.wxpath_async_blocking_iter(expr):
|
|
41
|
+
print(link)
|
|
42
|
+
```
|
|
43
|
+
|
|
44
|
+
|
|
45
|
+
By introducing the `url(...)` operator and the `///` syntax, wxpath's engine is able to perform deep (or paginated) web crawling and extraction:
|
|
46
|
+
|
|
47
|
+
```python
|
|
48
|
+
import wxpath
|
|
49
|
+
|
|
50
|
+
path_expr = """
|
|
51
|
+
url('https://quotes.toscrape.com')
|
|
52
|
+
///url(//a/@href)
|
|
53
|
+
//a/@href
|
|
54
|
+
"""
|
|
55
|
+
|
|
56
|
+
for item in wxpath.wxpath_async_blocking_iter(path_expr, max_depth=1):
|
|
57
|
+
print(item)
|
|
58
|
+
```
|
|
59
|
+
|
|
60
|
+
|
|
61
|
+
## Why wxpath?
|
|
62
|
+
|
|
63
|
+
Most web scrapers force you to write crawl control flow first, and extraction second.
|
|
64
|
+
|
|
65
|
+
**wxpath** inverts that:
|
|
66
|
+
- **You describe traversal declaratively**
|
|
67
|
+
- **Extraction is expressed inline**
|
|
68
|
+
- **The engine handles scheduling, concurrency, and deduplication**
|
|
36
69
|
|
|
37
70
|
|
|
38
71
|
## Contents
|
|
@@ -56,7 +89,7 @@ NOTE: This project is in early development. Core concepts are stable, but the AP
|
|
|
56
89
|
- [Advanced: Engine & Crawler Configuration](#advanced-engine--crawler-configuration)
|
|
57
90
|
- [Project Philosophy](#project-philosophy)
|
|
58
91
|
- [Warnings](#warnings)
|
|
59
|
-
- [Commercial support
|
|
92
|
+
- [Commercial support/consulting](#commercial-supportconsulting)
|
|
60
93
|
- [Versioning](#versioning)
|
|
61
94
|
- [License](#license)
|
|
62
95
|
|
|
@@ -73,7 +106,11 @@ CRAWLER_SETTINGS.headers = {'User-Agent': 'my-app/0.4.0 (contact: you@example.co
|
|
|
73
106
|
# Crawl, extract fields, build a knowledge graph
|
|
74
107
|
path_expr = """
|
|
75
108
|
url('https://en.wikipedia.org/wiki/Expression_language')
|
|
76
|
-
///url(
|
|
109
|
+
///url(
|
|
110
|
+
//main//a/@href[
|
|
111
|
+
starts-with(., '/wiki/') and not(contains(., ':'))
|
|
112
|
+
]
|
|
113
|
+
)
|
|
77
114
|
/map{
|
|
78
115
|
'title': (//span[contains(@class, "mw-page-title-main")]/text())[1] ! string(.),
|
|
79
116
|
'url': string(base-uri(.)),
|
|
@@ -86,15 +123,6 @@ for item in wxpath.wxpath_async_blocking_iter(path_expr, max_depth=1):
|
|
|
86
123
|
print(item)
|
|
87
124
|
```
|
|
88
125
|
|
|
89
|
-
Output:
|
|
90
|
-
|
|
91
|
-
```python
|
|
92
|
-
map{'title': 'Computer language', 'url': 'https://en.wikipedia.org/wiki/Computer_language', 'short_description': 'Formal language for communicating with a computer', 'forward_links': ['/wiki/Formal_language', '/wiki/Communication', ...]}
|
|
93
|
-
map{'title': 'Advanced Boolean Expression Language', 'url': 'https://en.wikipedia.org/wiki/Advanced_Boolean_Expression_Language', 'short_description': 'Hardware description language and software', 'forward_links': ['/wiki/File:ABEL_HDL_example_SN74162.png', '/wiki/Hardware_description_language', ...]}
|
|
94
|
-
map{'title': 'Machine-readable medium and data', 'url': 'https://en.wikipedia.org/wiki/Machine_readable', 'short_description': 'Medium capable of storing data in a format readable by a machine', 'forward_links': ['/wiki/File:EAN-13-ISBN-13.svg', '/wiki/ISBN', ...]}
|
|
95
|
-
...
|
|
96
|
-
```
|
|
97
|
-
|
|
98
126
|
**Note:** Some sites (including Wikipedia) may block requests without proper headers.
|
|
99
127
|
See [Advanced: Engine & Crawler Configuration](#advanced-engine--crawler-configuration) to set a custom `User-Agent`.
|
|
100
128
|
|
|
@@ -406,6 +434,17 @@ path_expr = "url('https://en.wikipedia.org/wiki/Expression_language')//url(//mai
|
|
|
406
434
|
items = list(wxpath_async_blocking_iter(path_expr, max_depth=1, engine=engine))
|
|
407
435
|
```
|
|
408
436
|
|
|
437
|
+
### Runtime API (`wxpath_async*`) options
|
|
438
|
+
|
|
439
|
+
- `max_depth`: int = 1
|
|
440
|
+
- `progress`: bool = False
|
|
441
|
+
- `engine`: WXPathEngine | None = None
|
|
442
|
+
- `yield_errors`: bool = False
|
|
443
|
+
|
|
444
|
+
|
|
445
|
+
### Settings
|
|
446
|
+
You can also use [settings.py](src/wxpath/settings.py) to enable caching, throttling, concurrency and more.
|
|
447
|
+
|
|
409
448
|
|
|
410
449
|
## Project Philosophy
|
|
411
450
|
|
|
@@ -433,13 +472,15 @@ The following features are not yet supported:
|
|
|
433
472
|
|
|
434
473
|
## WARNINGS!!!
|
|
435
474
|
|
|
475
|
+
This project is in early development. Core concepts are stable, but the API and features may change. Please report issues - in particular, deadlocked crawls or unexpected behavior - and any features you'd like to see (no guarantee they'll be implemented).
|
|
476
|
+
|
|
436
477
|
- Be respectful when crawling websites. A scrapy-inspired throttler is enabled by default.
|
|
437
478
|
- Deep crawls (`///`) require user discipline to avoid unbounded expansion (traversal explosion).
|
|
438
479
|
- Deadlocks and hangs are possible in certain situations (e.g., all tasks waiting on blocked requests). Please report issues if you encounter such behavior.
|
|
439
480
|
- Consider using timeouts, `max_depth`, and XPath predicates and filters to limit crawl scope.
|
|
440
481
|
|
|
441
482
|
|
|
442
|
-
## Commercial support
|
|
483
|
+
## Commercial support/consulting
|
|
443
484
|
|
|
444
485
|
If you want help building or operating crawlers/data feeds with wxpath (extraction, scheduling, monitoring, breakage fixes) or other web-scraping needs, please contact me at: rodrigopala91@gmail.com.
|
|
445
486
|
|
|
File without changes
|
|
File without changes
|
|
File without changes
|
|
File without changes
|
|
File without changes
|
|
File without changes
|
|
File without changes
|
|
File without changes
|
|
File without changes
|
|
File without changes
|
|
File without changes
|
|
File without changes
|
|
File without changes
|
|
File without changes
|
|
File without changes
|
|
File without changes
|
|
File without changes
|
|
File without changes
|
|
File without changes
|
|
File without changes
|
|
File without changes
|
|
File without changes
|
|
File without changes
|
|
File without changes
|
|
File without changes
|
|
File without changes
|
|
File without changes
|
|
File without changes
|
|
File without changes
|
|
File without changes
|
|
File without changes
|
|
File without changes
|
|
File without changes
|