PyPI - wxpath - Versions diffs - 0.4.0__tar.gz → 0.4.1__tar.gz - Mend

wxpath 0.4.0tar.gz → 0.4.1tar.gz

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (40) hide show

{wxpath-0.4.0/src/wxpath.egg-info → wxpath-0.4.1}/PKG-INFO RENAMED Viewed

@@ -1,6 +1,6 @@
 Metadata-Version: 2.4
 Name: wxpath
-Version: 0.4.0
+Version: 0.4.1
 Summary: wxpath - a declarative web crawler and data extractor
 Author-email: Rodrigo Palacios <rodrigopala91@gmail.com>
 License-Expression: MIT
@@ -30,9 +30,42 @@ Dynamic: license-file
 **wxpath** is a declarative web crawler where traversal is expressed directly in XPath. Instead of writing imperative crawl loops, wxpath lets you describe what to follow and what to extract in a single expression. **wxpath** executes that expression concurrently, breadth-first-*ish*, and streams results as they are discovered.
-By introducing the `url(...)` operator and the `///` syntax, wxpath's engine is able to perform deep (or paginated) web crawling and extraction.
+This expression fetches a page, extracts links, and streams them concurrently - no crawl loop required:
-NOTE: This project is in early development. Core concepts are stable, but the API and features may change. Please report issues - in particular, deadlocked crawls or unexpected behavior - and any features you'd like to see (no guarantee they'll be implemented).
+```python
+import wxpath
+expr = "url('https://example.com')//a/@href"
+for link in wxpath.wxpath_async_blocking_iter(expr):
+    print(link)
+```
+By introducing the `url(...)` operator and the `///` syntax, wxpath's engine is able to perform deep (or paginated) web crawling and extraction:
+```python
+import wxpath
+path_expr = """
+url('https://quotes.toscrape.com')
+  ///url(//a/@href)
+    //a/@href
+"""
+for item in wxpath.wxpath_async_blocking_iter(path_expr, max_depth=1):
+    print(item)
+```
+## Why wxpath?
+Most web scrapers force you to write crawl control flow first, and extraction second.
+**wxpath** inverts that:
+- **You describe traversal declaratively**
+- **Extraction is expressed inline**
+- **The engine handles scheduling, concurrency, and deduplication**
 ## Contents
@@ -56,7 +89,7 @@ NOTE: This project is in early development. Core concepts are stable, but the AP
 - [Advanced: Engine & Crawler Configuration](#advanced-engine--crawler-configuration)
 - [Project Philosophy](#project-philosophy)
 - [Warnings](#warnings)
-- [Commercial support / consulting](#commercial-support--consulting)
+- [Commercial support/consulting](#commercial-supportconsulting)
 - [Versioning](#versioning)
 - [License](#license)
@@ -73,7 +106,11 @@ CRAWLER_SETTINGS.headers = {'User-Agent': 'my-app/0.4.0 (contact: you@example.co
 # Crawl, extract fields, build a knowledge graph
 path_expr = """
 url('https://en.wikipedia.org/wiki/Expression_language')
-  ///url(//main//a/@href[starts-with(., '/wiki/') and not(contains(., ':'))])
+  ///url(
+        //main//a/@href[
+            starts-with(., '/wiki/') and not(contains(., ':'))
+        ]
+    )
     /map{
         'title': (//span[contains(@class, "mw-page-title-main")]/text())[1] ! string(.),
         'url': string(base-uri(.)),
@@ -86,15 +123,6 @@ for item in wxpath.wxpath_async_blocking_iter(path_expr, max_depth=1):
     print(item)
 ```
-Output:
-```python
-map{'title': 'Computer language', 'url': 'https://en.wikipedia.org/wiki/Computer_language', 'short_description': 'Formal language for communicating with a computer', 'forward_links': ['/wiki/Formal_language', '/wiki/Communication', ...]}
-map{'title': 'Advanced Boolean Expression Language', 'url': 'https://en.wikipedia.org/wiki/Advanced_Boolean_Expression_Language', 'short_description': 'Hardware description language and software', 'forward_links': ['/wiki/File:ABEL_HDL_example_SN74162.png', '/wiki/Hardware_description_language', ...]}
-map{'title': 'Machine-readable medium and data', 'url': 'https://en.wikipedia.org/wiki/Machine_readable', 'short_description': 'Medium capable of storing data in a format readable by a machine', 'forward_links': ['/wiki/File:EAN-13-ISBN-13.svg', '/wiki/ISBN', ...]}
-...
-```
 **Note:** Some sites (including Wikipedia) may block requests without proper headers.
 See [Advanced: Engine & Crawler Configuration](#advanced-engine--crawler-configuration) to set a custom `User-Agent`.
@@ -406,6 +434,17 @@ path_expr = "url('https://en.wikipedia.org/wiki/Expression_language')//url(//mai
 items = list(wxpath_async_blocking_iter(path_expr, max_depth=1, engine=engine))
 ```
+### Runtime API (`wxpath_async*`) options
+- `max_depth`: int = 1
+- `progress`: bool = False
+- `engine`: WXPathEngine | None = None
+- `yield_errors`: bool = False
+### Settings
+You can also use [settings.py](src/wxpath/settings.py) to enable caching, throttling, concurrency and more.
 ## Project Philosophy
@@ -433,13 +472,15 @@ The following features are not yet supported:
 ## WARNINGS!!!
+This project is in early development. Core concepts are stable, but the API and features may change. Please report issues - in particular, deadlocked crawls or unexpected behavior - and any features you'd like to see (no guarantee they'll be implemented).
 - Be respectful when crawling websites. A scrapy-inspired throttler is enabled by default.
 - Deep crawls (`///`) require user discipline to avoid unbounded expansion (traversal explosion).
 - Deadlocks and hangs are possible in certain situations (e.g., all tasks waiting on blocked requests). Please report issues if you encounter such behavior.
 - Consider using timeouts, `max_depth`, and XPath predicates and filters to limit crawl scope.
-## Commercial support / consulting
+## Commercial support/consulting
 If you want help building or operating crawlers/data feeds with wxpath (extraction, scheduling, monitoring, breakage fixes) or other web-scraping needs, please contact me at: rodrigopala91@gmail.com.

{wxpath-0.4.0 → wxpath-0.4.1}/README.md RENAMED Viewed

@@ -4,9 +4,42 @@
 **wxpath** is a declarative web crawler where traversal is expressed directly in XPath. Instead of writing imperative crawl loops, wxpath lets you describe what to follow and what to extract in a single expression. **wxpath** executes that expression concurrently, breadth-first-*ish*, and streams results as they are discovered.
-By introducing the `url(...)` operator and the `///` syntax, wxpath's engine is able to perform deep (or paginated) web crawling and extraction.
+This expression fetches a page, extracts links, and streams them concurrently - no crawl loop required:
-NOTE: This project is in early development. Core concepts are stable, but the API and features may change. Please report issues - in particular, deadlocked crawls or unexpected behavior - and any features you'd like to see (no guarantee they'll be implemented).
+```python
+import wxpath
+expr = "url('https://example.com')//a/@href"
+for link in wxpath.wxpath_async_blocking_iter(expr):
+    print(link)
+```
+By introducing the `url(...)` operator and the `///` syntax, wxpath's engine is able to perform deep (or paginated) web crawling and extraction:
+```python
+import wxpath
+path_expr = """
+url('https://quotes.toscrape.com')
+  ///url(//a/@href)
+    //a/@href
+"""
+for item in wxpath.wxpath_async_blocking_iter(path_expr, max_depth=1):
+    print(item)
+```
+## Why wxpath?
+Most web scrapers force you to write crawl control flow first, and extraction second.
+**wxpath** inverts that:
+- **You describe traversal declaratively**
+- **Extraction is expressed inline**
+- **The engine handles scheduling, concurrency, and deduplication**
 ## Contents
@@ -30,7 +63,7 @@ NOTE: This project is in early development. Core concepts are stable, but the AP
 - [Advanced: Engine & Crawler Configuration](#advanced-engine--crawler-configuration)
 - [Project Philosophy](#project-philosophy)
 - [Warnings](#warnings)
-- [Commercial support / consulting](#commercial-support--consulting)
+- [Commercial support/consulting](#commercial-supportconsulting)
 - [Versioning](#versioning)
 - [License](#license)
@@ -47,7 +80,11 @@ CRAWLER_SETTINGS.headers = {'User-Agent': 'my-app/0.4.0 (contact: you@example.co
 # Crawl, extract fields, build a knowledge graph
 path_expr = """
 url('https://en.wikipedia.org/wiki/Expression_language')
-  ///url(//main//a/@href[starts-with(., '/wiki/') and not(contains(., ':'))])
+  ///url(
+        //main//a/@href[
+            starts-with(., '/wiki/') and not(contains(., ':'))
+        ]
+    )
     /map{
         'title': (//span[contains(@class, "mw-page-title-main")]/text())[1] ! string(.),
         'url': string(base-uri(.)),
@@ -60,15 +97,6 @@ for item in wxpath.wxpath_async_blocking_iter(path_expr, max_depth=1):
     print(item)
 ```
-Output:
-```python
-map{'title': 'Computer language', 'url': 'https://en.wikipedia.org/wiki/Computer_language', 'short_description': 'Formal language for communicating with a computer', 'forward_links': ['/wiki/Formal_language', '/wiki/Communication', ...]}
-map{'title': 'Advanced Boolean Expression Language', 'url': 'https://en.wikipedia.org/wiki/Advanced_Boolean_Expression_Language', 'short_description': 'Hardware description language and software', 'forward_links': ['/wiki/File:ABEL_HDL_example_SN74162.png', '/wiki/Hardware_description_language', ...]}
-map{'title': 'Machine-readable medium and data', 'url': 'https://en.wikipedia.org/wiki/Machine_readable', 'short_description': 'Medium capable of storing data in a format readable by a machine', 'forward_links': ['/wiki/File:EAN-13-ISBN-13.svg', '/wiki/ISBN', ...]}
-...
-```
 **Note:** Some sites (including Wikipedia) may block requests without proper headers.
 See [Advanced: Engine & Crawler Configuration](#advanced-engine--crawler-configuration) to set a custom `User-Agent`.
@@ -380,6 +408,17 @@ path_expr = "url('https://en.wikipedia.org/wiki/Expression_language')//url(//mai
 items = list(wxpath_async_blocking_iter(path_expr, max_depth=1, engine=engine))
 ```
+### Runtime API (`wxpath_async*`) options
+- `max_depth`: int = 1
+- `progress`: bool = False
+- `engine`: WXPathEngine | None = None
+- `yield_errors`: bool = False
+### Settings
+You can also use [settings.py](src/wxpath/settings.py) to enable caching, throttling, concurrency and more.
 ## Project Philosophy
@@ -407,13 +446,15 @@ The following features are not yet supported:
 ## WARNINGS!!!
+This project is in early development. Core concepts are stable, but the API and features may change. Please report issues - in particular, deadlocked crawls or unexpected behavior - and any features you'd like to see (no guarantee they'll be implemented).
 - Be respectful when crawling websites. A scrapy-inspired throttler is enabled by default.
 - Deep crawls (`///`) require user discipline to avoid unbounded expansion (traversal explosion).
 - Deadlocks and hangs are possible in certain situations (e.g., all tasks waiting on blocked requests). Please report issues if you encounter such behavior.
 - Consider using timeouts, `max_depth`, and XPath predicates and filters to limit crawl scope.
-## Commercial support / consulting
+## Commercial support/consulting
 If you want help building or operating crawlers/data feeds with wxpath (extraction, scheduling, monitoring, breakage fixes) or other web-scraping needs, please contact me at: rodrigopala91@gmail.com.

{wxpath-0.4.0 → wxpath-0.4.1}/pyproject.toml RENAMED Viewed

@@ -4,7 +4,7 @@ build-backend = "setuptools.build_meta"
 [project]
 name = "wxpath"
-version = "0.4.0"
+version = "0.4.1"
 description = "wxpath - a declarative web crawler and data extractor"
 readme = "README.md"
 requires-python = ">=3.10"

{wxpath-0.4.0 → wxpath-0.4.1}/src/wxpath/core/runtime/engine.py RENAMED Viewed

@@ -162,7 +162,8 @@ class WXPathEngine(HookedEngineBase):
             self,
             expression: str,
             max_depth: int,
-            progress: bool = False
+            progress: bool = False,
+            yield_errors: bool = False,
         ) -> AsyncGenerator[Any, None]:
         """Execute a wxpath expression concurrently and yield results.
@@ -248,12 +249,32 @@ class WXPathEngine(HookedEngineBase):
                 if task is None:
                     log.warning(f"Got unexpected response from {resp.request.url}")
+                    if yield_errors:
+                        yield {
+                            "__type__": "error",
+                            "url": resp.request.url,
+                            "reason": "unexpected_response",
+                            "status": resp.body,
+                            "body": resp.body
+                        }
                     if is_terminal():
                         break
                     continue
                 if resp.error:
                     log.warning(f"Got error from {resp.request.url}: {resp.error}")
+                    if yield_errors:
+                        yield {
+                            "__type__": "error",
+                            "url": resp.request.url,
+                            "reason": "network_error",
+                            "exception": str(resp.error),
+                            "status": resp.status,
+                            "body": resp.body
+                        }
                     if is_terminal():
                         break
                     continue
@@ -261,6 +282,16 @@ class WXPathEngine(HookedEngineBase):
                 # NOTE: Consider allowing redirects
                 if resp.status not in self.allowed_response_codes or not resp.body:
                     log.warning(f"Got non-200 response from {resp.request.url}")
+                    if yield_errors:
+                        yield {
+                            "__type__": "error",
+                            "url": resp.request.url,
+                            "reason": "bad_status",
+                            "status": resp.status,
+                            "body": resp.body
+                        }
                     if is_terminal():
                         break
                     continue
@@ -388,10 +419,12 @@ class WXPathEngine(HookedEngineBase):
 def wxpath_async(path_expr: str,
                  max_depth: int,
                  progress: bool = False,
-                 engine: WXPathEngine | None = None) -> AsyncGenerator[Any, None]:
+                 engine: WXPathEngine | None = None,
+                 yield_errors: bool = False
+                 ) -> AsyncGenerator[Any, None]:
     if engine is None:
         engine = WXPathEngine()
-    return engine.run(path_expr, max_depth, progress=progress)
+    return engine.run(path_expr, max_depth, progress=progress, yield_errors=yield_errors)
 ##### ASYNC IN SYNC #####
@@ -400,6 +433,7 @@ def wxpath_async_blocking_iter(
     max_depth: int = 1,
     progress: bool = False,
     engine: WXPathEngine | None = None,
+    yield_errors: bool = False
 ) -> Iterator[Any]:
     """Evaluate a wxpath expression using concurrent breadth-first traversal.
@@ -419,7 +453,8 @@ def wxpath_async_blocking_iter(
     """
     loop = asyncio.new_event_loop()
     asyncio.set_event_loop(loop)
-    agen = wxpath_async(path_expr, max_depth=max_depth, progress=progress, engine=engine)
+    agen = wxpath_async(path_expr, max_depth=max_depth, progress=progress,
+                        engine=engine, yield_errors=yield_errors)
     try:
         while True:
@@ -437,8 +472,11 @@ def wxpath_async_blocking(
     max_depth: int = 1,
     progress: bool = False,
     engine: WXPathEngine | None = None,
+    yield_errors: bool = False
 ) -> list[Any]:
     return list(wxpath_async_blocking_iter(path_expr,
                                            max_depth=max_depth,
                                            progress=progress,
-                                           engine=engine))
+                                           engine=engine,
+                                           yield_errors=yield_errors,
+                                           ))

{wxpath-0.4.0 → wxpath-0.4.1}/src/wxpath/http/client/crawler.py RENAMED Viewed

@@ -1,7 +1,7 @@
 import aiohttp
 try:
-    from aiohttp_client_cache import CachedSession, SQLiteBackend
+    from aiohttp_client_cache import CachedSession
 except ImportError:
     CachedSession = None
@@ -42,7 +42,7 @@ def get_async_session(
     if timeout is None:
         timeout = aiohttp.ClientTimeout(total=CRAWLER_SETTINGS.timeout)
-    if CACHE_SETTINGS.enabled and CachedSession and SQLiteBackend:
+    if CACHE_SETTINGS.enabled and CachedSession:
         log.info("using aiohttp-client-cache")
         return CachedSession(
             cache=get_cache_backend(),

{wxpath-0.4.0 → wxpath-0.4.1}/src/wxpath/http/client/request.py RENAMED Viewed

@@ -9,7 +9,7 @@ class Request:
     url: str
     method: str = "GET"
     headers: dict[str, str] = field(default_factory=dict)
-    timeout: float = 15.0
+    timeout: float | None = None
     retries: int = 0
     max_retries: int | None = None

{wxpath-0.4.0 → wxpath-0.4.1/src/wxpath.egg-info}/PKG-INFO RENAMED Viewed

@@ -1,6 +1,6 @@
 Metadata-Version: 2.4
 Name: wxpath
-Version: 0.4.0
+Version: 0.4.1
 Summary: wxpath - a declarative web crawler and data extractor
 Author-email: Rodrigo Palacios <rodrigopala91@gmail.com>
 License-Expression: MIT
@@ -30,9 +30,42 @@ Dynamic: license-file
 **wxpath** is a declarative web crawler where traversal is expressed directly in XPath. Instead of writing imperative crawl loops, wxpath lets you describe what to follow and what to extract in a single expression. **wxpath** executes that expression concurrently, breadth-first-*ish*, and streams results as they are discovered.
-By introducing the `url(...)` operator and the `///` syntax, wxpath's engine is able to perform deep (or paginated) web crawling and extraction.
+This expression fetches a page, extracts links, and streams them concurrently - no crawl loop required:
-NOTE: This project is in early development. Core concepts are stable, but the API and features may change. Please report issues - in particular, deadlocked crawls or unexpected behavior - and any features you'd like to see (no guarantee they'll be implemented).
+```python
+import wxpath
+expr = "url('https://example.com')//a/@href"
+for link in wxpath.wxpath_async_blocking_iter(expr):
+    print(link)
+```
+By introducing the `url(...)` operator and the `///` syntax, wxpath's engine is able to perform deep (or paginated) web crawling and extraction:
+```python
+import wxpath
+path_expr = """
+url('https://quotes.toscrape.com')
+  ///url(//a/@href)
+    //a/@href
+"""
+for item in wxpath.wxpath_async_blocking_iter(path_expr, max_depth=1):
+    print(item)
+```
+## Why wxpath?
+Most web scrapers force you to write crawl control flow first, and extraction second.
+**wxpath** inverts that:
+- **You describe traversal declaratively**
+- **Extraction is expressed inline**
+- **The engine handles scheduling, concurrency, and deduplication**
 ## Contents
@@ -56,7 +89,7 @@ NOTE: This project is in early development. Core concepts are stable, but the AP
 - [Advanced: Engine & Crawler Configuration](#advanced-engine--crawler-configuration)
 - [Project Philosophy](#project-philosophy)
 - [Warnings](#warnings)
-- [Commercial support / consulting](#commercial-support--consulting)
+- [Commercial support/consulting](#commercial-supportconsulting)
 - [Versioning](#versioning)
 - [License](#license)
@@ -73,7 +106,11 @@ CRAWLER_SETTINGS.headers = {'User-Agent': 'my-app/0.4.0 (contact: you@example.co
 # Crawl, extract fields, build a knowledge graph
 path_expr = """
 url('https://en.wikipedia.org/wiki/Expression_language')
-  ///url(//main//a/@href[starts-with(., '/wiki/') and not(contains(., ':'))])
+  ///url(
+        //main//a/@href[
+            starts-with(., '/wiki/') and not(contains(., ':'))
+        ]
+    )
     /map{
         'title': (//span[contains(@class, "mw-page-title-main")]/text())[1] ! string(.),
         'url': string(base-uri(.)),
@@ -86,15 +123,6 @@ for item in wxpath.wxpath_async_blocking_iter(path_expr, max_depth=1):
     print(item)
 ```
-Output:
-```python
-map{'title': 'Computer language', 'url': 'https://en.wikipedia.org/wiki/Computer_language', 'short_description': 'Formal language for communicating with a computer', 'forward_links': ['/wiki/Formal_language', '/wiki/Communication', ...]}
-map{'title': 'Advanced Boolean Expression Language', 'url': 'https://en.wikipedia.org/wiki/Advanced_Boolean_Expression_Language', 'short_description': 'Hardware description language and software', 'forward_links': ['/wiki/File:ABEL_HDL_example_SN74162.png', '/wiki/Hardware_description_language', ...]}
-map{'title': 'Machine-readable medium and data', 'url': 'https://en.wikipedia.org/wiki/Machine_readable', 'short_description': 'Medium capable of storing data in a format readable by a machine', 'forward_links': ['/wiki/File:EAN-13-ISBN-13.svg', '/wiki/ISBN', ...]}
-...
-```
 **Note:** Some sites (including Wikipedia) may block requests without proper headers.
 See [Advanced: Engine & Crawler Configuration](#advanced-engine--crawler-configuration) to set a custom `User-Agent`.
@@ -406,6 +434,17 @@ path_expr = "url('https://en.wikipedia.org/wiki/Expression_language')//url(//mai
 items = list(wxpath_async_blocking_iter(path_expr, max_depth=1, engine=engine))
 ```
+### Runtime API (`wxpath_async*`) options
+- `max_depth`: int = 1
+- `progress`: bool = False
+- `engine`: WXPathEngine | None = None
+- `yield_errors`: bool = False
+### Settings
+You can also use [settings.py](src/wxpath/settings.py) to enable caching, throttling, concurrency and more.
 ## Project Philosophy
@@ -433,13 +472,15 @@ The following features are not yet supported:
 ## WARNINGS!!!
+This project is in early development. Core concepts are stable, but the API and features may change. Please report issues - in particular, deadlocked crawls or unexpected behavior - and any features you'd like to see (no guarantee they'll be implemented).
 - Be respectful when crawling websites. A scrapy-inspired throttler is enabled by default.
 - Deep crawls (`///`) require user discipline to avoid unbounded expansion (traversal explosion).
 - Deadlocks and hangs are possible in certain situations (e.g., all tasks waiting on blocked requests). Please report issues if you encounter such behavior.
 - Consider using timeouts, `max_depth`, and XPath predicates and filters to limit crawl scope.
-## Commercial support / consulting
+## Commercial support/consulting
 If you want help building or operating crawlers/data feeds with wxpath (extraction, scheduling, monitoring, breakage fixes) or other web-scraping needs, please contact me at: rodrigopala91@gmail.com.