wxpath 0.2.0__tar.gz → 0.4.0__tar.gz

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
Files changed (46) hide show
  1. {wxpath-0.2.0/src/wxpath.egg-info → wxpath-0.4.0}/PKG-INFO +165 -42
  2. {wxpath-0.2.0 → wxpath-0.4.0}/README.md +156 -39
  3. {wxpath-0.2.0 → wxpath-0.4.0}/pyproject.toml +9 -5
  4. wxpath-0.4.0/src/wxpath/cli.py +137 -0
  5. wxpath-0.4.0/src/wxpath/core/ops.py +278 -0
  6. wxpath-0.4.0/src/wxpath/core/parser.py +598 -0
  7. {wxpath-0.2.0 → wxpath-0.4.0}/src/wxpath/core/runtime/engine.py +177 -48
  8. {wxpath-0.2.0 → wxpath-0.4.0}/src/wxpath/core/runtime/helpers.py +0 -7
  9. {wxpath-0.2.0 → wxpath-0.4.0}/src/wxpath/hooks/registry.py +29 -17
  10. wxpath-0.4.0/src/wxpath/http/client/cache.py +43 -0
  11. wxpath-0.4.0/src/wxpath/http/client/crawler.py +315 -0
  12. {wxpath-0.2.0 → wxpath-0.4.0}/src/wxpath/http/client/request.py +6 -3
  13. {wxpath-0.2.0 → wxpath-0.4.0}/src/wxpath/http/client/response.py +1 -1
  14. wxpath-0.4.0/src/wxpath/http/policy/robots.py +82 -0
  15. {wxpath-0.2.0 → wxpath-0.4.0}/src/wxpath/http/stats.py +6 -0
  16. wxpath-0.4.0/src/wxpath/settings.py +108 -0
  17. {wxpath-0.2.0 → wxpath-0.4.0/src/wxpath.egg-info}/PKG-INFO +165 -42
  18. {wxpath-0.2.0 → wxpath-0.4.0}/src/wxpath.egg-info/SOURCES.txt +3 -1
  19. wxpath-0.4.0/src/wxpath.egg-info/requires.txt +20 -0
  20. wxpath-0.2.0/src/wxpath/cli.py +0 -52
  21. wxpath-0.2.0/src/wxpath/core/errors.py +0 -134
  22. wxpath-0.2.0/src/wxpath/core/ops.py +0 -244
  23. wxpath-0.2.0/src/wxpath/core/parser.py +0 -319
  24. wxpath-0.2.0/src/wxpath/http/client/crawler.py +0 -196
  25. wxpath-0.2.0/src/wxpath.egg-info/requires.txt +0 -11
  26. {wxpath-0.2.0 → wxpath-0.4.0}/LICENSE +0 -0
  27. {wxpath-0.2.0 → wxpath-0.4.0}/setup.cfg +0 -0
  28. {wxpath-0.2.0 → wxpath-0.4.0}/src/wxpath/__init__.py +0 -0
  29. {wxpath-0.2.0 → wxpath-0.4.0}/src/wxpath/core/__init__.py +0 -0
  30. {wxpath-0.2.0 → wxpath-0.4.0}/src/wxpath/core/dom.py +0 -0
  31. {wxpath-0.2.0 → wxpath-0.4.0}/src/wxpath/core/models.py +0 -0
  32. {wxpath-0.2.0 → wxpath-0.4.0}/src/wxpath/core/runtime/__init__.py +0 -0
  33. {wxpath-0.2.0 → wxpath-0.4.0}/src/wxpath/hooks/__init__.py +0 -0
  34. {wxpath-0.2.0 → wxpath-0.4.0}/src/wxpath/hooks/builtin.py +0 -0
  35. {wxpath-0.2.0 → wxpath-0.4.0}/src/wxpath/http/__init__.py +0 -0
  36. {wxpath-0.2.0 → wxpath-0.4.0}/src/wxpath/http/client/__init__.py +0 -0
  37. {wxpath-0.2.0 → wxpath-0.4.0}/src/wxpath/http/policy/backoff.py +0 -0
  38. {wxpath-0.2.0 → wxpath-0.4.0}/src/wxpath/http/policy/retry.py +0 -0
  39. {wxpath-0.2.0 → wxpath-0.4.0}/src/wxpath/http/policy/throttler.py +0 -0
  40. {wxpath-0.2.0 → wxpath-0.4.0}/src/wxpath/patches.py +0 -0
  41. {wxpath-0.2.0 → wxpath-0.4.0}/src/wxpath/util/__init__.py +0 -0
  42. {wxpath-0.2.0 → wxpath-0.4.0}/src/wxpath/util/logging.py +0 -0
  43. {wxpath-0.2.0 → wxpath-0.4.0}/src/wxpath/util/serialize.py +0 -0
  44. {wxpath-0.2.0 → wxpath-0.4.0}/src/wxpath.egg-info/dependency_links.txt +0 -0
  45. {wxpath-0.2.0 → wxpath-0.4.0}/src/wxpath.egg-info/entry_points.txt +0 -0
  46. {wxpath-0.2.0 → wxpath-0.4.0}/src/wxpath.egg-info/top_level.txt +0 -0
@@ -1,16 +1,22 @@
1
1
  Metadata-Version: 2.4
2
2
  Name: wxpath
3
- Version: 0.2.0
3
+ Version: 0.4.0
4
4
  Summary: wxpath - a declarative web crawler and data extractor
5
5
  Author-email: Rodrigo Palacios <rodrigopala91@gmail.com>
6
6
  License-Expression: MIT
7
- Requires-Python: >=3.9
7
+ Requires-Python: >=3.10
8
8
  Description-Content-Type: text/markdown
9
9
  License-File: LICENSE
10
- Requires-Dist: requests>=2.0
11
10
  Requires-Dist: lxml>=4.0
12
11
  Requires-Dist: elementpath<=5.0.3,>=5.0.0
13
12
  Requires-Dist: aiohttp<=3.12.15,>=3.8.0
13
+ Requires-Dist: tqdm>=4.0.0
14
+ Provides-Extra: cache
15
+ Requires-Dist: aiohttp-client-cache>=0.14.0; extra == "cache"
16
+ Provides-Extra: cache-sqlite
17
+ Requires-Dist: aiohttp-client-cache[sqlite]; extra == "cache-sqlite"
18
+ Provides-Extra: cache-redis
19
+ Requires-Dist: aiohttp-client-cache[redis]; extra == "cache-redis"
14
20
  Provides-Extra: test
15
21
  Requires-Dist: pytest>=7.0; extra == "test"
16
22
  Requires-Dist: pytest-asyncio>=0.23; extra == "test"
@@ -18,12 +24,13 @@ Provides-Extra: dev
18
24
  Requires-Dist: ruff; extra == "dev"
19
25
  Dynamic: license-file
20
26
 
27
+ # **wxpath** - declarative web crawling with XPath
21
28
 
22
- # wxpath - declarative web crawling with XPath
29
+ [![Python 3.10+](https://img.shields.io/badge/python-3.10+-blue.svg)](https://www.python.org/downloads/release/python-3100/)
23
30
 
24
- **wxpath** is a declarative web crawler where traversal is expressed directly in XPath. Instead of writing imperative crawl loops, you describe what to follow and what to extract in a single expression. **wxpath** evaluates that expression concurrently, breadth-first-*ish*, and streams results as they are discovered.
31
+ **wxpath** is a declarative web crawler where traversal is expressed directly in XPath. Instead of writing imperative crawl loops, wxpath lets you describe what to follow and what to extract in a single expression. **wxpath** executes that expression concurrently, breadth-first-*ish*, and streams results as they are discovered.
25
32
 
26
- By introducing the `url(...)` operator and the `///` syntax, **wxpath**'s engine is able to perform deep, recursive web crawling and extraction.
33
+ By introducing the `url(...)` operator and the `///` syntax, wxpath's engine is able to perform deep (or paginated) web crawling and extraction.
27
34
 
28
35
  NOTE: This project is in early development. Core concepts are stable, but the API and features may change. Please report issues - in particular, deadlocked crawls or unexpected behavior - and any features you'd like to see (no guarantee they'll be implemented).
29
36
 
@@ -31,19 +38,26 @@ NOTE: This project is in early development. Core concepts are stable, but the AP
31
38
  ## Contents
32
39
 
33
40
  - [Example](#example)
34
- - [`url(...)` and `///url(...)` Explained](#url-and---explained)
41
+ - [Language Design](DESIGN.md)
42
+ - [`url(...)` and `///url(...)` Explained](#url-and-url-explained)
35
43
  - [General flow](#general-flow)
36
44
  - [Asynchronous Crawling](#asynchronous-crawling)
45
+ - [Polite Crawling](#polite-crawling)
37
46
  - [Output types](#output-types)
38
- - [XPath 3.1 support](#xpath-31-support)
47
+ - [XPath 3.1](#xpath-31-by-default)
48
+ - [Progress Bar](#progress-bar)
39
49
  - [CLI](#cli)
50
+ - [Persistence and Caching](#persistence-and-caching)
51
+ - [Settings](#settings)
40
52
  - [Hooks (Experimental)](#hooks-experimental)
41
53
  - [Install](#install)
42
- - [More Examples](#more-examples)
54
+ - [More Examples](EXAMPLES.md)
43
55
  - [Comparisons](#comparisons)
44
56
  - [Advanced: Engine & Crawler Configuration](#advanced-engine--crawler-configuration)
45
57
  - [Project Philosophy](#project-philosophy)
46
58
  - [Warnings](#warnings)
59
+ - [Commercial support / consulting](#commercial-support--consulting)
60
+ - [Versioning](#versioning)
47
61
  - [License](#license)
48
62
 
49
63
 
@@ -51,34 +65,40 @@ NOTE: This project is in early development. Core concepts are stable, but the AP
51
65
 
52
66
  ```python
53
67
  import wxpath
68
+ from wxpath.settings import CRAWLER_SETTINGS
54
69
 
55
- path = """
70
+ # Custom headers for politeness; necessary for some sites (e.g., Wikipedia)
71
+ CRAWLER_SETTINGS.headers = {'User-Agent': 'my-app/0.4.0 (contact: you@example.com)'}
72
+
73
+ # Crawl, extract fields, build a knowledge graph
74
+ path_expr = """
56
75
  url('https://en.wikipedia.org/wiki/Expression_language')
57
- ///url(//main//a/@href[starts-with(., '/wiki/') and not(contains(., ':'))])
58
- /map{
59
- 'title':(//span[contains(@class, "mw-page-title-main")]/text())[1],
60
- 'url':string(base-uri(.)),
61
- 'short_description':(//div[contains(@class, 'shortdescription')]/text())[1]
62
- }
76
+ ///url(//main//a/@href[starts-with(., '/wiki/') and not(contains(., ':'))])
77
+ /map{
78
+ 'title': (//span[contains(@class, "mw-page-title-main")]/text())[1] ! string(.),
79
+ 'url': string(base-uri(.)),
80
+ 'short_description': //div[contains(@class, 'shortdescription')]/text() ! string(.),
81
+ 'forward_links': //div[@id="mw-content-text"]//a/@href ! string(.)
82
+ }
63
83
  """
64
84
 
65
- for item in wxpath.wxpath_async_blocking_iter(path, max_depth=1):
85
+ for item in wxpath.wxpath_async_blocking_iter(path_expr, max_depth=1):
66
86
  print(item)
67
87
  ```
68
88
 
69
89
  Output:
70
90
 
71
91
  ```python
72
- map{'title': TextNode('Computer language'), 'url': 'https://en.wikipedia.org/wiki/Computer_language', 'short_description': TextNode('Formal language for communicating with a computer')}
73
- map{'title': TextNode('Machine-readable medium and data'), 'url': 'https://en.wikipedia.org/wiki/Machine_readable', 'short_description': TextNode('Medium capable of storing data in a format readable by a machine')}
74
- map{'title': TextNode('Advanced Boolean Expression Language'), 'url': 'https://en.wikipedia.org/wiki/Advanced_Boolean_Expression_Language', 'short_description': TextNode('Hardware description language and software')}
75
- map{'title': TextNode('Jakarta Expression Language'), 'url': 'https://en.wikipedia.org/wiki/Jakarta_Expression_Language', 'short_description': TextNode('Computer programming language')}
76
- map{'title': TextNode('Data Analysis Expressions'), 'url': 'https://en.wikipedia.org/wiki/Data_Analysis_Expressions', 'short_description': TextNode('Formula and data query language')}
77
- map{'title': TextNode('Domain knowledge'), 'url': 'https://en.wikipedia.org/wiki/Domain_knowledge', 'short_description': TextNode('Specialist knowledge within a specific field')}
78
- map{'title': TextNode('Rights Expression Language'), 'url': 'https://en.wikipedia.org/wiki/Rights_Expression_Language', 'short_description': TextNode('Machine-processable language used to express intellectual property rights (such as copyright)')}
79
- map{'title': TextNode('Computer science'), 'url': 'https://en.wikipedia.org/wiki/Computer_science', 'short_description': TextNode('Study of computation')}
92
+ map{'title': 'Computer language', 'url': 'https://en.wikipedia.org/wiki/Computer_language', 'short_description': 'Formal language for communicating with a computer', 'forward_links': ['/wiki/Formal_language', '/wiki/Communication', ...]}
93
+ map{'title': 'Advanced Boolean Expression Language', 'url': 'https://en.wikipedia.org/wiki/Advanced_Boolean_Expression_Language', 'short_description': 'Hardware description language and software', 'forward_links': ['/wiki/File:ABEL_HDL_example_SN74162.png', '/wiki/Hardware_description_language', ...]}
94
+ map{'title': 'Machine-readable medium and data', 'url': 'https://en.wikipedia.org/wiki/Machine_readable', 'short_description': 'Medium capable of storing data in a format readable by a machine', 'forward_links': ['/wiki/File:EAN-13-ISBN-13.svg', '/wiki/ISBN', ...]}
95
+ ...
80
96
  ```
81
97
 
98
+ **Note:** Some sites (including Wikipedia) may block requests without proper headers.
99
+ See [Advanced: Engine & Crawler Configuration](#advanced-engine--crawler-configuration) to set a custom `User-Agent`.
100
+
101
+
82
102
  The above expression does the following:
83
103
 
84
104
  1. Starts at the specified URL, `https://en.wikipedia.org/wiki/Expression_language`.
@@ -92,18 +112,23 @@ The above expression does the following:
92
112
  ## `url(...)` and `///url(...)` Explained
93
113
 
94
114
  - `url(...)` is a custom operator that fetches the content of the user-specified or internally generated URL and returns it as an `lxml.html.HtmlElement` for further XPath processing.
95
- - `///url(...)` indicates infinite/recursive traversal. It tells **wxpath** to continue following links indefinitely, up to the specified `max_depth`. Unlike repeated `url()` hops, it allows a single expression to describe unbounded graph exploration. WARNING: Use with caution and constraints (via `max_depth` or XPath predicates) to avoid traversal explosion.
115
+ - `///url(...)` indicates a deep crawl. It tells the runtime engine to continue following links up to the specified `max_depth`. Unlike repeated `url()` hops, it allows a single expression to describe deeper graph exploration. WARNING: Use with caution and constraints (via `max_depth` or XPath predicates) to avoid traversal explosion.
116
+
117
+
118
+ ## Language Design
119
+
120
+ See [DESIGN.md](DESIGN.md) for details of the language design. You will see the core concepts and design the language from the ground up.
96
121
 
97
122
 
98
123
  ## General flow
99
124
 
100
125
  **wxpath** evaluates an expression as a list of traversal and extraction steps (internally referred to as `Segment`s).
101
126
 
102
- `url(...)` creates crawl tasks either statically (via a fixed URL) or dynamically (via a URL derived from the XPath expression). **URLs are deduplicated globally, not per-depth and on a best-effort basis**.
127
+ `url(...)` creates crawl tasks either statically (via a fixed URL) or dynamically (via a URL derived from the XPath expression). **URLs are deduplicated globally, on a best-effort basis - not per-depth**.
103
128
 
104
129
  XPath segments operate on fetched documents (fetched via the immediately preceding `url(...)` operations).
105
130
 
106
- `///url(...)` indicates infinite/recursive traversal - it proceeds breadth-first-*ish* up to `max_depth`.
131
+ `///url(...)` indicates deep crawling - it proceeds breadth-first-*ish* up to `max_depth`.
107
132
 
108
133
  Results are yielded as soon as they are ready.
109
134
 
@@ -128,7 +153,7 @@ asyncio.run(main())
128
153
 
129
154
  ### Blocking, Concurrent Requests
130
155
 
131
- **wxpath** also supports concurrent requests using an asyncio-in-sync pattern, allowing you to crawl multiple pages concurrently while maintaining the simplicity of synchronous code. This is particularly useful for crawls in strictly synchronous execution environments (i.e., not inside an `asyncio` event loop) where performance is a concern.
156
+ **wxpath** also provides an asyncio-in-sync API, allowing you to crawl multiple pages concurrently while maintaining the simplicity of synchronous code. This is particularly useful for crawls in strictly synchronous execution environments (i.e., not inside an `asyncio` event loop) where performance is a concern.
132
157
 
133
158
  ```python
134
159
  from wxpath import wxpath_async_blocking_iter
@@ -137,10 +162,14 @@ path_expr = "url('https://en.wikipedia.org/wiki/Expression_language')///url(//@h
137
162
  items = list(wxpath_async_blocking_iter(path_expr, max_depth=1))
138
163
  ```
139
164
 
165
+ ## Polite Crawling
166
+
167
+ **wxpath** respects [robots.txt](https://en.wikipedia.org/wiki/Robots_exclusion_standard) by default via the `WXPathEngine(..., robotstxt=True)` constructor.
168
+
140
169
 
141
170
  ## Output types
142
171
 
143
- The wxpath Python API yields structured objects, not just strings.
172
+ The wxpath Python API yields structured objects.
144
173
 
145
174
  Depending on the expression, results may include:
146
175
 
@@ -181,6 +210,17 @@ path_expr = """
181
210
  # ...]
182
211
  ```
183
212
 
213
+ ## Progress Bar
214
+
215
+ **wxpath** provides a progress bar (via `tqdm`) to track crawl progress. This is especially useful for long-running crawls.
216
+
217
+ Enable by setting `engine.run(..., progress=True)`, or pass `progress=True` to any of the `wxpath_async*(...)` functions.
218
+
219
+ ```python
220
+ items = wxpath.wxpath_async_blocking("...", progress=True)
221
+ > 100%|██████████████████████████████████████████████████████████▎| 469/471 [00:05<00:00, 72.00it/s, depth=2, yielded=457]
222
+ ```
223
+
184
224
 
185
225
  ## CLI
186
226
 
@@ -188,10 +228,11 @@ path_expr = """
188
228
 
189
229
  The following example demonstrates how to crawl Wikipedia starting from the "Expression language" page, extract links to other wiki pages, and retrieve specific fields from each linked page.
190
230
 
191
- WARNING: Due to the everchanging nature of web content, the output may vary over time.
231
+ NOTE: Due to the everchanging nature of web content, the output may vary over time.
192
232
  ```bash
193
- > wxpath --depth 1 "\
194
- url('https://en.wikipedia.org/wiki/Expression_language')\
233
+ > wxpath --depth 1 \
234
+ --header "User-Agent: my-app/0.1 (contact: you@example.com)" \
235
+ "url('https://en.wikipedia.org/wiki/Expression_language') \
195
236
  ///url(//div[@id='mw-content-text']//a/@href[starts-with(., '/wiki/') \
196
237
  and not(matches(@href, '^(?:/wiki/)?(?:Wikipedia|File|Template|Special|Template_talk|Help):'))]) \
197
238
  /map{ \
@@ -212,6 +253,55 @@ WARNING: Due to the everchanging nature of web content, the output may vary over
212
253
  {"title": "Computer science", "short_description": "Study of computation", "url": "https://en.wikipedia.org/wiki/Computer_science", "backlink": "https://en.wikipedia.org/wiki/Expression_language", "depth": 1.0}
213
254
  ```
214
255
 
256
+ Command line options:
257
+
258
+ ```bash
259
+ --depth <depth> Max crawl depth
260
+ --verbose [true|false] Provides superficial CLI information
261
+ --debug [true|false] Provides verbose runtime output and information
262
+ --concurrency <concurrency> Number of concurrent fetches
263
+ --concurrency-per-host <concurrency> Number of concurrent fetches per host
264
+ --header "Key:Value" Add a custom header (e.g., 'Key:Value'). Can be used multiple times.
265
+ --respect-robots [true|false] (Default: True) Respects robots.txt
266
+ --cache [true|false] (Default: False) Persist crawl results to a local database
267
+ ```
268
+
269
+
270
+ ## Persistence and Caching
271
+
272
+ **wxpath** optionally persists crawl results to a local database. This is especially useful when you're crawling a large number of URLs, and you decide to pause the crawl, change extraction expressions, or otherwise need to restart the crawl.
273
+
274
+ **wxpath** supports two backends: sqlite and redis. SQLite is great for small-scale crawls, with a single worker (i.e., `engine.crawler.concurrency == 1`). Redis is great for large-scale crawls, with multiple workers. You will be encounter a warning if you `min(engine.crawler.concurrency, engine.crawler.per_host) > 1` when using the sqlite backend.
275
+
276
+ To use, you must install the appropriate optional dependency:
277
+
278
+ ```bash
279
+ pip install wxpath[cache-sqlite]
280
+ pip install wxpath[cache-redis]
281
+ ```
282
+
283
+ Once the dependency is installed, you must enable the cache:
284
+
285
+ ```python
286
+ from wxpath.settings import SETTINGS
287
+
288
+ # To enable caching; sqlite is the default
289
+ SETTINGS.http.client.cache.enabled = True
290
+
291
+ # For redis backend
292
+ SETTINGS.http.client.cache.enabled = True
293
+ SETTINGS.http.client.cache.backend = "redis"
294
+ SETTINGS.http.client.cache.redis.address = "redis://localhost:6379/0"
295
+
296
+ # Run wxpath as usual
297
+ items = list(wxpath_async_blocking_iter('...', max_depth=1, engine=engine))
298
+ ```
299
+
300
+
301
+ ## Settings
302
+
303
+ See [settings.py](src/wxpath/settings.py) for details of the settings.
304
+
215
305
 
216
306
  ## Hooks (Experimental)
217
307
 
@@ -257,10 +347,19 @@ hooks.register(hooks.JSONLWriter)
257
347
 
258
348
  ## Install
259
349
 
350
+ Requires Python 3.10+.
351
+
260
352
  ```
261
353
  pip install wxpath
262
354
  ```
263
355
 
356
+ For persisted/cached, wxpath supports the following backends:
357
+
358
+ ```
359
+ pip install wxpath[cache-sqlite]
360
+ pip install wxpath[cache-redis]
361
+ ```
362
+
264
363
 
265
364
  ## More Examples
266
365
 
@@ -285,13 +384,20 @@ crawler = Crawler(
285
384
  concurrency=8,
286
385
  per_host=2,
287
386
  timeout=10,
387
+ respect_robots=False,
388
+ headers={
389
+ "User-Agent": "my-app/0.1.0 (contact: you@example.com)", # Sites like Wikipedia will appreciate this
390
+ },
288
391
  )
289
392
 
290
393
  # If `crawler` is not specified, a default Crawler will be created with
291
- # the provided concurrency and per_host values, or with defaults.
394
+ # the provided concurrency, per_host, and respect_robots values, or with defaults.
292
395
  engine = WXPathEngine(
293
- # concurrency=16,
294
- # per_host=8,
396
+ # concurrency: int = 16,
397
+ # per_host: int = 8,
398
+ # respect_robots: bool = True,
399
+ # allowed_response_codes: set[int] = {200},
400
+ # allow_redirects: bool = True,
295
401
  crawler=crawler,
296
402
  )
297
403
 
@@ -305,33 +411,50 @@ items = list(wxpath_async_blocking_iter(path_expr, max_depth=1, engine=engine))
305
411
 
306
412
  ### Principles
307
413
 
308
- - Enable declarative, recursive scraping without boilerplate
414
+ - Enable declarative, crawling and scraping without boilerplate
309
415
  - Stay lightweight and composable
310
416
  - Asynchronous support for high-performance crawls
311
417
 
312
- ### Guarantees/Goals
418
+ ### Goals
313
419
 
314
420
  - URLs are deduplicated on a best-effort, per-crawl basis.
315
421
  - Crawls are intended to terminate once the frontier is exhausted or `max_depth` is reached.
316
422
  - Requests are performed concurrently.
317
423
  - Results are streamed as soon as they are available.
318
424
 
319
- ### Non-Goals/Limitations (for now)
425
+ ### Limitations (for now)
426
+
427
+ The following features are not yet supported:
320
428
 
321
- - Strict result ordering
322
- - Persistent scheduling or crawl resumption
323
429
  - Automatic proxy rotation
324
430
  - Browser-based rendering (JavaScript execution)
431
+ - Strict result ordering
325
432
 
326
433
 
327
434
  ## WARNINGS!!!
328
435
 
329
436
  - Be respectful when crawling websites. A scrapy-inspired throttler is enabled by default.
330
- - Recursive (`///`) crawls require user discipline to avoid unbounded expansion (traversal explosion).
437
+ - Deep crawls (`///`) require user discipline to avoid unbounded expansion (traversal explosion).
331
438
  - Deadlocks and hangs are possible in certain situations (e.g., all tasks waiting on blocked requests). Please report issues if you encounter such behavior.
332
439
  - Consider using timeouts, `max_depth`, and XPath predicates and filters to limit crawl scope.
333
440
 
334
441
 
442
+ ## Commercial support / consulting
443
+
444
+ If you want help building or operating crawlers/data feeds with wxpath (extraction, scheduling, monitoring, breakage fixes) or other web-scraping needs, please contact me at: rodrigopala91@gmail.com.
445
+
446
+
447
+ ### Donate
448
+
449
+ If you like wxpath and want to support its development, please consider [donating](https://www.paypal.com/donate/?business=WDNDK6J6PJEXY&no_recurring=0&item_name=Thanks+for+using+wxpath%21+Donations+fund+development%2C+docs%2C+and+bug+fixes.+If+wxpath+saved+you+time%2C+a+small+contribution+helps%21&currency_code=USD).
450
+
451
+
452
+ ## Versioning
453
+
454
+ **wxpath** follows [semver](https://semver.org): `<MAJOR>.<MINOR>.<PATCH>`.
455
+
456
+ However, pre-1.0.0 follows `0.<MAJOR>.<MINOR|PATCH>`.
457
+
335
458
  ## License
336
459
 
337
460
  MIT