wxpath 0.4.0__tar.gz → 0.4.1__tar.gz

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
Files changed (40) hide show
  1. {wxpath-0.4.0/src/wxpath.egg-info → wxpath-0.4.1}/PKG-INFO +56 -15
  2. {wxpath-0.4.0 → wxpath-0.4.1}/README.md +55 -14
  3. {wxpath-0.4.0 → wxpath-0.4.1}/pyproject.toml +1 -1
  4. {wxpath-0.4.0 → wxpath-0.4.1}/src/wxpath/core/runtime/engine.py +43 -5
  5. {wxpath-0.4.0 → wxpath-0.4.1}/src/wxpath/http/client/crawler.py +2 -2
  6. {wxpath-0.4.0 → wxpath-0.4.1}/src/wxpath/http/client/request.py +1 -1
  7. {wxpath-0.4.0 → wxpath-0.4.1/src/wxpath.egg-info}/PKG-INFO +56 -15
  8. {wxpath-0.4.0 → wxpath-0.4.1}/LICENSE +0 -0
  9. {wxpath-0.4.0 → wxpath-0.4.1}/setup.cfg +0 -0
  10. {wxpath-0.4.0 → wxpath-0.4.1}/src/wxpath/__init__.py +0 -0
  11. {wxpath-0.4.0 → wxpath-0.4.1}/src/wxpath/cli.py +0 -0
  12. {wxpath-0.4.0 → wxpath-0.4.1}/src/wxpath/core/__init__.py +0 -0
  13. {wxpath-0.4.0 → wxpath-0.4.1}/src/wxpath/core/dom.py +0 -0
  14. {wxpath-0.4.0 → wxpath-0.4.1}/src/wxpath/core/models.py +0 -0
  15. {wxpath-0.4.0 → wxpath-0.4.1}/src/wxpath/core/ops.py +0 -0
  16. {wxpath-0.4.0 → wxpath-0.4.1}/src/wxpath/core/parser.py +0 -0
  17. {wxpath-0.4.0 → wxpath-0.4.1}/src/wxpath/core/runtime/__init__.py +0 -0
  18. {wxpath-0.4.0 → wxpath-0.4.1}/src/wxpath/core/runtime/helpers.py +0 -0
  19. {wxpath-0.4.0 → wxpath-0.4.1}/src/wxpath/hooks/__init__.py +0 -0
  20. {wxpath-0.4.0 → wxpath-0.4.1}/src/wxpath/hooks/builtin.py +0 -0
  21. {wxpath-0.4.0 → wxpath-0.4.1}/src/wxpath/hooks/registry.py +0 -0
  22. {wxpath-0.4.0 → wxpath-0.4.1}/src/wxpath/http/__init__.py +0 -0
  23. {wxpath-0.4.0 → wxpath-0.4.1}/src/wxpath/http/client/__init__.py +0 -0
  24. {wxpath-0.4.0 → wxpath-0.4.1}/src/wxpath/http/client/cache.py +0 -0
  25. {wxpath-0.4.0 → wxpath-0.4.1}/src/wxpath/http/client/response.py +0 -0
  26. {wxpath-0.4.0 → wxpath-0.4.1}/src/wxpath/http/policy/backoff.py +0 -0
  27. {wxpath-0.4.0 → wxpath-0.4.1}/src/wxpath/http/policy/retry.py +0 -0
  28. {wxpath-0.4.0 → wxpath-0.4.1}/src/wxpath/http/policy/robots.py +0 -0
  29. {wxpath-0.4.0 → wxpath-0.4.1}/src/wxpath/http/policy/throttler.py +0 -0
  30. {wxpath-0.4.0 → wxpath-0.4.1}/src/wxpath/http/stats.py +0 -0
  31. {wxpath-0.4.0 → wxpath-0.4.1}/src/wxpath/patches.py +0 -0
  32. {wxpath-0.4.0 → wxpath-0.4.1}/src/wxpath/settings.py +0 -0
  33. {wxpath-0.4.0 → wxpath-0.4.1}/src/wxpath/util/__init__.py +0 -0
  34. {wxpath-0.4.0 → wxpath-0.4.1}/src/wxpath/util/logging.py +0 -0
  35. {wxpath-0.4.0 → wxpath-0.4.1}/src/wxpath/util/serialize.py +0 -0
  36. {wxpath-0.4.0 → wxpath-0.4.1}/src/wxpath.egg-info/SOURCES.txt +0 -0
  37. {wxpath-0.4.0 → wxpath-0.4.1}/src/wxpath.egg-info/dependency_links.txt +0 -0
  38. {wxpath-0.4.0 → wxpath-0.4.1}/src/wxpath.egg-info/entry_points.txt +0 -0
  39. {wxpath-0.4.0 → wxpath-0.4.1}/src/wxpath.egg-info/requires.txt +0 -0
  40. {wxpath-0.4.0 → wxpath-0.4.1}/src/wxpath.egg-info/top_level.txt +0 -0
@@ -1,6 +1,6 @@
1
1
  Metadata-Version: 2.4
2
2
  Name: wxpath
3
- Version: 0.4.0
3
+ Version: 0.4.1
4
4
  Summary: wxpath - a declarative web crawler and data extractor
5
5
  Author-email: Rodrigo Palacios <rodrigopala91@gmail.com>
6
6
  License-Expression: MIT
@@ -30,9 +30,42 @@ Dynamic: license-file
30
30
 
31
31
  **wxpath** is a declarative web crawler where traversal is expressed directly in XPath. Instead of writing imperative crawl loops, wxpath lets you describe what to follow and what to extract in a single expression. **wxpath** executes that expression concurrently, breadth-first-*ish*, and streams results as they are discovered.
32
32
 
33
- By introducing the `url(...)` operator and the `///` syntax, wxpath's engine is able to perform deep (or paginated) web crawling and extraction.
33
+ This expression fetches a page, extracts links, and streams them concurrently - no crawl loop required:
34
34
 
35
- NOTE: This project is in early development. Core concepts are stable, but the API and features may change. Please report issues - in particular, deadlocked crawls or unexpected behavior - and any features you'd like to see (no guarantee they'll be implemented).
35
+ ```python
36
+ import wxpath
37
+
38
+ expr = "url('https://example.com')//a/@href"
39
+
40
+ for link in wxpath.wxpath_async_blocking_iter(expr):
41
+ print(link)
42
+ ```
43
+
44
+
45
+ By introducing the `url(...)` operator and the `///` syntax, wxpath's engine is able to perform deep (or paginated) web crawling and extraction:
46
+
47
+ ```python
48
+ import wxpath
49
+
50
+ path_expr = """
51
+ url('https://quotes.toscrape.com')
52
+ ///url(//a/@href)
53
+ //a/@href
54
+ """
55
+
56
+ for item in wxpath.wxpath_async_blocking_iter(path_expr, max_depth=1):
57
+ print(item)
58
+ ```
59
+
60
+
61
+ ## Why wxpath?
62
+
63
+ Most web scrapers force you to write crawl control flow first, and extraction second.
64
+
65
+ **wxpath** inverts that:
66
+ - **You describe traversal declaratively**
67
+ - **Extraction is expressed inline**
68
+ - **The engine handles scheduling, concurrency, and deduplication**
36
69
 
37
70
 
38
71
  ## Contents
@@ -56,7 +89,7 @@ NOTE: This project is in early development. Core concepts are stable, but the AP
56
89
  - [Advanced: Engine & Crawler Configuration](#advanced-engine--crawler-configuration)
57
90
  - [Project Philosophy](#project-philosophy)
58
91
  - [Warnings](#warnings)
59
- - [Commercial support / consulting](#commercial-support--consulting)
92
+ - [Commercial support/consulting](#commercial-supportconsulting)
60
93
  - [Versioning](#versioning)
61
94
  - [License](#license)
62
95
 
@@ -73,7 +106,11 @@ CRAWLER_SETTINGS.headers = {'User-Agent': 'my-app/0.4.0 (contact: you@example.co
73
106
  # Crawl, extract fields, build a knowledge graph
74
107
  path_expr = """
75
108
  url('https://en.wikipedia.org/wiki/Expression_language')
76
- ///url(//main//a/@href[starts-with(., '/wiki/') and not(contains(., ':'))])
109
+ ///url(
110
+ //main//a/@href[
111
+ starts-with(., '/wiki/') and not(contains(., ':'))
112
+ ]
113
+ )
77
114
  /map{
78
115
  'title': (//span[contains(@class, "mw-page-title-main")]/text())[1] ! string(.),
79
116
  'url': string(base-uri(.)),
@@ -86,15 +123,6 @@ for item in wxpath.wxpath_async_blocking_iter(path_expr, max_depth=1):
86
123
  print(item)
87
124
  ```
88
125
 
89
- Output:
90
-
91
- ```python
92
- map{'title': 'Computer language', 'url': 'https://en.wikipedia.org/wiki/Computer_language', 'short_description': 'Formal language for communicating with a computer', 'forward_links': ['/wiki/Formal_language', '/wiki/Communication', ...]}
93
- map{'title': 'Advanced Boolean Expression Language', 'url': 'https://en.wikipedia.org/wiki/Advanced_Boolean_Expression_Language', 'short_description': 'Hardware description language and software', 'forward_links': ['/wiki/File:ABEL_HDL_example_SN74162.png', '/wiki/Hardware_description_language', ...]}
94
- map{'title': 'Machine-readable medium and data', 'url': 'https://en.wikipedia.org/wiki/Machine_readable', 'short_description': 'Medium capable of storing data in a format readable by a machine', 'forward_links': ['/wiki/File:EAN-13-ISBN-13.svg', '/wiki/ISBN', ...]}
95
- ...
96
- ```
97
-
98
126
  **Note:** Some sites (including Wikipedia) may block requests without proper headers.
99
127
  See [Advanced: Engine & Crawler Configuration](#advanced-engine--crawler-configuration) to set a custom `User-Agent`.
100
128
 
@@ -406,6 +434,17 @@ path_expr = "url('https://en.wikipedia.org/wiki/Expression_language')//url(//mai
406
434
  items = list(wxpath_async_blocking_iter(path_expr, max_depth=1, engine=engine))
407
435
  ```
408
436
 
437
+ ### Runtime API (`wxpath_async*`) options
438
+
439
+ - `max_depth`: int = 1
440
+ - `progress`: bool = False
441
+ - `engine`: WXPathEngine | None = None
442
+ - `yield_errors`: bool = False
443
+
444
+
445
+ ### Settings
446
+ You can also use [settings.py](src/wxpath/settings.py) to enable caching, throttling, concurrency and more.
447
+
409
448
 
410
449
  ## Project Philosophy
411
450
 
@@ -433,13 +472,15 @@ The following features are not yet supported:
433
472
 
434
473
  ## WARNINGS!!!
435
474
 
475
+ This project is in early development. Core concepts are stable, but the API and features may change. Please report issues - in particular, deadlocked crawls or unexpected behavior - and any features you'd like to see (no guarantee they'll be implemented).
476
+
436
477
  - Be respectful when crawling websites. A scrapy-inspired throttler is enabled by default.
437
478
  - Deep crawls (`///`) require user discipline to avoid unbounded expansion (traversal explosion).
438
479
  - Deadlocks and hangs are possible in certain situations (e.g., all tasks waiting on blocked requests). Please report issues if you encounter such behavior.
439
480
  - Consider using timeouts, `max_depth`, and XPath predicates and filters to limit crawl scope.
440
481
 
441
482
 
442
- ## Commercial support / consulting
483
+ ## Commercial support/consulting
443
484
 
444
485
  If you want help building or operating crawlers/data feeds with wxpath (extraction, scheduling, monitoring, breakage fixes) or other web-scraping needs, please contact me at: rodrigopala91@gmail.com.
445
486
 
@@ -4,9 +4,42 @@
4
4
 
5
5
  **wxpath** is a declarative web crawler where traversal is expressed directly in XPath. Instead of writing imperative crawl loops, wxpath lets you describe what to follow and what to extract in a single expression. **wxpath** executes that expression concurrently, breadth-first-*ish*, and streams results as they are discovered.
6
6
 
7
- By introducing the `url(...)` operator and the `///` syntax, wxpath's engine is able to perform deep (or paginated) web crawling and extraction.
7
+ This expression fetches a page, extracts links, and streams them concurrently - no crawl loop required:
8
8
 
9
- NOTE: This project is in early development. Core concepts are stable, but the API and features may change. Please report issues - in particular, deadlocked crawls or unexpected behavior - and any features you'd like to see (no guarantee they'll be implemented).
9
+ ```python
10
+ import wxpath
11
+
12
+ expr = "url('https://example.com')//a/@href"
13
+
14
+ for link in wxpath.wxpath_async_blocking_iter(expr):
15
+ print(link)
16
+ ```
17
+
18
+
19
+ By introducing the `url(...)` operator and the `///` syntax, wxpath's engine is able to perform deep (or paginated) web crawling and extraction:
20
+
21
+ ```python
22
+ import wxpath
23
+
24
+ path_expr = """
25
+ url('https://quotes.toscrape.com')
26
+ ///url(//a/@href)
27
+ //a/@href
28
+ """
29
+
30
+ for item in wxpath.wxpath_async_blocking_iter(path_expr, max_depth=1):
31
+ print(item)
32
+ ```
33
+
34
+
35
+ ## Why wxpath?
36
+
37
+ Most web scrapers force you to write crawl control flow first, and extraction second.
38
+
39
+ **wxpath** inverts that:
40
+ - **You describe traversal declaratively**
41
+ - **Extraction is expressed inline**
42
+ - **The engine handles scheduling, concurrency, and deduplication**
10
43
 
11
44
 
12
45
  ## Contents
@@ -30,7 +63,7 @@ NOTE: This project is in early development. Core concepts are stable, but the AP
30
63
  - [Advanced: Engine & Crawler Configuration](#advanced-engine--crawler-configuration)
31
64
  - [Project Philosophy](#project-philosophy)
32
65
  - [Warnings](#warnings)
33
- - [Commercial support / consulting](#commercial-support--consulting)
66
+ - [Commercial support/consulting](#commercial-supportconsulting)
34
67
  - [Versioning](#versioning)
35
68
  - [License](#license)
36
69
 
@@ -47,7 +80,11 @@ CRAWLER_SETTINGS.headers = {'User-Agent': 'my-app/0.4.0 (contact: you@example.co
47
80
  # Crawl, extract fields, build a knowledge graph
48
81
  path_expr = """
49
82
  url('https://en.wikipedia.org/wiki/Expression_language')
50
- ///url(//main//a/@href[starts-with(., '/wiki/') and not(contains(., ':'))])
83
+ ///url(
84
+ //main//a/@href[
85
+ starts-with(., '/wiki/') and not(contains(., ':'))
86
+ ]
87
+ )
51
88
  /map{
52
89
  'title': (//span[contains(@class, "mw-page-title-main")]/text())[1] ! string(.),
53
90
  'url': string(base-uri(.)),
@@ -60,15 +97,6 @@ for item in wxpath.wxpath_async_blocking_iter(path_expr, max_depth=1):
60
97
  print(item)
61
98
  ```
62
99
 
63
- Output:
64
-
65
- ```python
66
- map{'title': 'Computer language', 'url': 'https://en.wikipedia.org/wiki/Computer_language', 'short_description': 'Formal language for communicating with a computer', 'forward_links': ['/wiki/Formal_language', '/wiki/Communication', ...]}
67
- map{'title': 'Advanced Boolean Expression Language', 'url': 'https://en.wikipedia.org/wiki/Advanced_Boolean_Expression_Language', 'short_description': 'Hardware description language and software', 'forward_links': ['/wiki/File:ABEL_HDL_example_SN74162.png', '/wiki/Hardware_description_language', ...]}
68
- map{'title': 'Machine-readable medium and data', 'url': 'https://en.wikipedia.org/wiki/Machine_readable', 'short_description': 'Medium capable of storing data in a format readable by a machine', 'forward_links': ['/wiki/File:EAN-13-ISBN-13.svg', '/wiki/ISBN', ...]}
69
- ...
70
- ```
71
-
72
100
  **Note:** Some sites (including Wikipedia) may block requests without proper headers.
73
101
  See [Advanced: Engine & Crawler Configuration](#advanced-engine--crawler-configuration) to set a custom `User-Agent`.
74
102
 
@@ -380,6 +408,17 @@ path_expr = "url('https://en.wikipedia.org/wiki/Expression_language')//url(//mai
380
408
  items = list(wxpath_async_blocking_iter(path_expr, max_depth=1, engine=engine))
381
409
  ```
382
410
 
411
+ ### Runtime API (`wxpath_async*`) options
412
+
413
+ - `max_depth`: int = 1
414
+ - `progress`: bool = False
415
+ - `engine`: WXPathEngine | None = None
416
+ - `yield_errors`: bool = False
417
+
418
+
419
+ ### Settings
420
+ You can also use [settings.py](src/wxpath/settings.py) to enable caching, throttling, concurrency and more.
421
+
383
422
 
384
423
  ## Project Philosophy
385
424
 
@@ -407,13 +446,15 @@ The following features are not yet supported:
407
446
 
408
447
  ## WARNINGS!!!
409
448
 
449
+ This project is in early development. Core concepts are stable, but the API and features may change. Please report issues - in particular, deadlocked crawls or unexpected behavior - and any features you'd like to see (no guarantee they'll be implemented).
450
+
410
451
  - Be respectful when crawling websites. A scrapy-inspired throttler is enabled by default.
411
452
  - Deep crawls (`///`) require user discipline to avoid unbounded expansion (traversal explosion).
412
453
  - Deadlocks and hangs are possible in certain situations (e.g., all tasks waiting on blocked requests). Please report issues if you encounter such behavior.
413
454
  - Consider using timeouts, `max_depth`, and XPath predicates and filters to limit crawl scope.
414
455
 
415
456
 
416
- ## Commercial support / consulting
457
+ ## Commercial support/consulting
417
458
 
418
459
  If you want help building or operating crawlers/data feeds with wxpath (extraction, scheduling, monitoring, breakage fixes) or other web-scraping needs, please contact me at: rodrigopala91@gmail.com.
419
460
 
@@ -4,7 +4,7 @@ build-backend = "setuptools.build_meta"
4
4
 
5
5
  [project]
6
6
  name = "wxpath"
7
- version = "0.4.0"
7
+ version = "0.4.1"
8
8
  description = "wxpath - a declarative web crawler and data extractor"
9
9
  readme = "README.md"
10
10
  requires-python = ">=3.10"
@@ -162,7 +162,8 @@ class WXPathEngine(HookedEngineBase):
162
162
  self,
163
163
  expression: str,
164
164
  max_depth: int,
165
- progress: bool = False
165
+ progress: bool = False,
166
+ yield_errors: bool = False,
166
167
  ) -> AsyncGenerator[Any, None]:
167
168
  """Execute a wxpath expression concurrently and yield results.
168
169
 
@@ -248,12 +249,32 @@ class WXPathEngine(HookedEngineBase):
248
249
 
249
250
  if task is None:
250
251
  log.warning(f"Got unexpected response from {resp.request.url}")
252
+
253
+ if yield_errors:
254
+ yield {
255
+ "__type__": "error",
256
+ "url": resp.request.url,
257
+ "reason": "unexpected_response",
258
+ "status": resp.body,
259
+ "body": resp.body
260
+ }
261
+
251
262
  if is_terminal():
252
263
  break
253
264
  continue
254
265
 
255
266
  if resp.error:
256
267
  log.warning(f"Got error from {resp.request.url}: {resp.error}")
268
+
269
+ if yield_errors:
270
+ yield {
271
+ "__type__": "error",
272
+ "url": resp.request.url,
273
+ "reason": "network_error",
274
+ "exception": str(resp.error),
275
+ "status": resp.status,
276
+ "body": resp.body
277
+ }
257
278
  if is_terminal():
258
279
  break
259
280
  continue
@@ -261,6 +282,16 @@ class WXPathEngine(HookedEngineBase):
261
282
  # NOTE: Consider allowing redirects
262
283
  if resp.status not in self.allowed_response_codes or not resp.body:
263
284
  log.warning(f"Got non-200 response from {resp.request.url}")
285
+
286
+ if yield_errors:
287
+ yield {
288
+ "__type__": "error",
289
+ "url": resp.request.url,
290
+ "reason": "bad_status",
291
+ "status": resp.status,
292
+ "body": resp.body
293
+ }
294
+
264
295
  if is_terminal():
265
296
  break
266
297
  continue
@@ -388,10 +419,12 @@ class WXPathEngine(HookedEngineBase):
388
419
  def wxpath_async(path_expr: str,
389
420
  max_depth: int,
390
421
  progress: bool = False,
391
- engine: WXPathEngine | None = None) -> AsyncGenerator[Any, None]:
422
+ engine: WXPathEngine | None = None,
423
+ yield_errors: bool = False
424
+ ) -> AsyncGenerator[Any, None]:
392
425
  if engine is None:
393
426
  engine = WXPathEngine()
394
- return engine.run(path_expr, max_depth, progress=progress)
427
+ return engine.run(path_expr, max_depth, progress=progress, yield_errors=yield_errors)
395
428
 
396
429
 
397
430
  ##### ASYNC IN SYNC #####
@@ -400,6 +433,7 @@ def wxpath_async_blocking_iter(
400
433
  max_depth: int = 1,
401
434
  progress: bool = False,
402
435
  engine: WXPathEngine | None = None,
436
+ yield_errors: bool = False
403
437
  ) -> Iterator[Any]:
404
438
  """Evaluate a wxpath expression using concurrent breadth-first traversal.
405
439
 
@@ -419,7 +453,8 @@ def wxpath_async_blocking_iter(
419
453
  """
420
454
  loop = asyncio.new_event_loop()
421
455
  asyncio.set_event_loop(loop)
422
- agen = wxpath_async(path_expr, max_depth=max_depth, progress=progress, engine=engine)
456
+ agen = wxpath_async(path_expr, max_depth=max_depth, progress=progress,
457
+ engine=engine, yield_errors=yield_errors)
423
458
 
424
459
  try:
425
460
  while True:
@@ -437,8 +472,11 @@ def wxpath_async_blocking(
437
472
  max_depth: int = 1,
438
473
  progress: bool = False,
439
474
  engine: WXPathEngine | None = None,
475
+ yield_errors: bool = False
440
476
  ) -> list[Any]:
441
477
  return list(wxpath_async_blocking_iter(path_expr,
442
478
  max_depth=max_depth,
443
479
  progress=progress,
444
- engine=engine))
480
+ engine=engine,
481
+ yield_errors=yield_errors,
482
+ ))
@@ -1,7 +1,7 @@
1
1
  import aiohttp
2
2
 
3
3
  try:
4
- from aiohttp_client_cache import CachedSession, SQLiteBackend
4
+ from aiohttp_client_cache import CachedSession
5
5
  except ImportError:
6
6
  CachedSession = None
7
7
 
@@ -42,7 +42,7 @@ def get_async_session(
42
42
  if timeout is None:
43
43
  timeout = aiohttp.ClientTimeout(total=CRAWLER_SETTINGS.timeout)
44
44
 
45
- if CACHE_SETTINGS.enabled and CachedSession and SQLiteBackend:
45
+ if CACHE_SETTINGS.enabled and CachedSession:
46
46
  log.info("using aiohttp-client-cache")
47
47
  return CachedSession(
48
48
  cache=get_cache_backend(),
@@ -9,7 +9,7 @@ class Request:
9
9
  url: str
10
10
  method: str = "GET"
11
11
  headers: dict[str, str] = field(default_factory=dict)
12
- timeout: float = 15.0
12
+ timeout: float | None = None
13
13
 
14
14
  retries: int = 0
15
15
  max_retries: int | None = None
@@ -1,6 +1,6 @@
1
1
  Metadata-Version: 2.4
2
2
  Name: wxpath
3
- Version: 0.4.0
3
+ Version: 0.4.1
4
4
  Summary: wxpath - a declarative web crawler and data extractor
5
5
  Author-email: Rodrigo Palacios <rodrigopala91@gmail.com>
6
6
  License-Expression: MIT
@@ -30,9 +30,42 @@ Dynamic: license-file
30
30
 
31
31
  **wxpath** is a declarative web crawler where traversal is expressed directly in XPath. Instead of writing imperative crawl loops, wxpath lets you describe what to follow and what to extract in a single expression. **wxpath** executes that expression concurrently, breadth-first-*ish*, and streams results as they are discovered.
32
32
 
33
- By introducing the `url(...)` operator and the `///` syntax, wxpath's engine is able to perform deep (or paginated) web crawling and extraction.
33
+ This expression fetches a page, extracts links, and streams them concurrently - no crawl loop required:
34
34
 
35
- NOTE: This project is in early development. Core concepts are stable, but the API and features may change. Please report issues - in particular, deadlocked crawls or unexpected behavior - and any features you'd like to see (no guarantee they'll be implemented).
35
+ ```python
36
+ import wxpath
37
+
38
+ expr = "url('https://example.com')//a/@href"
39
+
40
+ for link in wxpath.wxpath_async_blocking_iter(expr):
41
+ print(link)
42
+ ```
43
+
44
+
45
+ By introducing the `url(...)` operator and the `///` syntax, wxpath's engine is able to perform deep (or paginated) web crawling and extraction:
46
+
47
+ ```python
48
+ import wxpath
49
+
50
+ path_expr = """
51
+ url('https://quotes.toscrape.com')
52
+ ///url(//a/@href)
53
+ //a/@href
54
+ """
55
+
56
+ for item in wxpath.wxpath_async_blocking_iter(path_expr, max_depth=1):
57
+ print(item)
58
+ ```
59
+
60
+
61
+ ## Why wxpath?
62
+
63
+ Most web scrapers force you to write crawl control flow first, and extraction second.
64
+
65
+ **wxpath** inverts that:
66
+ - **You describe traversal declaratively**
67
+ - **Extraction is expressed inline**
68
+ - **The engine handles scheduling, concurrency, and deduplication**
36
69
 
37
70
 
38
71
  ## Contents
@@ -56,7 +89,7 @@ NOTE: This project is in early development. Core concepts are stable, but the AP
56
89
  - [Advanced: Engine & Crawler Configuration](#advanced-engine--crawler-configuration)
57
90
  - [Project Philosophy](#project-philosophy)
58
91
  - [Warnings](#warnings)
59
- - [Commercial support / consulting](#commercial-support--consulting)
92
+ - [Commercial support/consulting](#commercial-supportconsulting)
60
93
  - [Versioning](#versioning)
61
94
  - [License](#license)
62
95
 
@@ -73,7 +106,11 @@ CRAWLER_SETTINGS.headers = {'User-Agent': 'my-app/0.4.0 (contact: you@example.co
73
106
  # Crawl, extract fields, build a knowledge graph
74
107
  path_expr = """
75
108
  url('https://en.wikipedia.org/wiki/Expression_language')
76
- ///url(//main//a/@href[starts-with(., '/wiki/') and not(contains(., ':'))])
109
+ ///url(
110
+ //main//a/@href[
111
+ starts-with(., '/wiki/') and not(contains(., ':'))
112
+ ]
113
+ )
77
114
  /map{
78
115
  'title': (//span[contains(@class, "mw-page-title-main")]/text())[1] ! string(.),
79
116
  'url': string(base-uri(.)),
@@ -86,15 +123,6 @@ for item in wxpath.wxpath_async_blocking_iter(path_expr, max_depth=1):
86
123
  print(item)
87
124
  ```
88
125
 
89
- Output:
90
-
91
- ```python
92
- map{'title': 'Computer language', 'url': 'https://en.wikipedia.org/wiki/Computer_language', 'short_description': 'Formal language for communicating with a computer', 'forward_links': ['/wiki/Formal_language', '/wiki/Communication', ...]}
93
- map{'title': 'Advanced Boolean Expression Language', 'url': 'https://en.wikipedia.org/wiki/Advanced_Boolean_Expression_Language', 'short_description': 'Hardware description language and software', 'forward_links': ['/wiki/File:ABEL_HDL_example_SN74162.png', '/wiki/Hardware_description_language', ...]}
94
- map{'title': 'Machine-readable medium and data', 'url': 'https://en.wikipedia.org/wiki/Machine_readable', 'short_description': 'Medium capable of storing data in a format readable by a machine', 'forward_links': ['/wiki/File:EAN-13-ISBN-13.svg', '/wiki/ISBN', ...]}
95
- ...
96
- ```
97
-
98
126
  **Note:** Some sites (including Wikipedia) may block requests without proper headers.
99
127
  See [Advanced: Engine & Crawler Configuration](#advanced-engine--crawler-configuration) to set a custom `User-Agent`.
100
128
 
@@ -406,6 +434,17 @@ path_expr = "url('https://en.wikipedia.org/wiki/Expression_language')//url(//mai
406
434
  items = list(wxpath_async_blocking_iter(path_expr, max_depth=1, engine=engine))
407
435
  ```
408
436
 
437
+ ### Runtime API (`wxpath_async*`) options
438
+
439
+ - `max_depth`: int = 1
440
+ - `progress`: bool = False
441
+ - `engine`: WXPathEngine | None = None
442
+ - `yield_errors`: bool = False
443
+
444
+
445
+ ### Settings
446
+ You can also use [settings.py](src/wxpath/settings.py) to enable caching, throttling, concurrency and more.
447
+
409
448
 
410
449
  ## Project Philosophy
411
450
 
@@ -433,13 +472,15 @@ The following features are not yet supported:
433
472
 
434
473
  ## WARNINGS!!!
435
474
 
475
+ This project is in early development. Core concepts are stable, but the API and features may change. Please report issues - in particular, deadlocked crawls or unexpected behavior - and any features you'd like to see (no guarantee they'll be implemented).
476
+
436
477
  - Be respectful when crawling websites. A scrapy-inspired throttler is enabled by default.
437
478
  - Deep crawls (`///`) require user discipline to avoid unbounded expansion (traversal explosion).
438
479
  - Deadlocks and hangs are possible in certain situations (e.g., all tasks waiting on blocked requests). Please report issues if you encounter such behavior.
439
480
  - Consider using timeouts, `max_depth`, and XPath predicates and filters to limit crawl scope.
440
481
 
441
482
 
442
- ## Commercial support / consulting
483
+ ## Commercial support/consulting
443
484
 
444
485
  If you want help building or operating crawlers/data feeds with wxpath (extraction, scheduling, monitoring, breakage fixes) or other web-scraping needs, please contact me at: rodrigopala91@gmail.com.
445
486
 
File without changes
File without changes
File without changes
File without changes
File without changes
File without changes
File without changes
File without changes
File without changes
File without changes
File without changes