PyPI - wxpath - Versions diffs - 0.1.0__tar.gz → 0.2.0__tar.gz - Mend

wxpath 0.1.0tar.gz → 0.2.0tar.gz

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (40) hide show

{wxpath-0.1.0/src/wxpath.egg-info → wxpath-0.2.0}/PKG-INFO +30 -97
{wxpath-0.1.0 → wxpath-0.2.0}/README.md +25 -94
{wxpath-0.1.0 → wxpath-0.2.0}/pyproject.toml +23 -5
{wxpath-0.1.0 → wxpath-0.2.0}/src/wxpath/cli.py +13 -28
wxpath-0.2.0/src/wxpath/core/__init__.py +13 -0
wxpath-0.2.0/src/wxpath/core/dom.py +22 -0
wxpath-0.2.0/src/wxpath/core/errors.py +134 -0
wxpath-0.2.0/src/wxpath/core/models.py +74 -0
wxpath-0.2.0/src/wxpath/core/ops.py +244 -0
wxpath-0.2.0/src/wxpath/core/parser.py +319 -0
wxpath-0.2.0/src/wxpath/core/runtime/__init__.py +5 -0
wxpath-0.2.0/src/wxpath/core/runtime/engine.py +315 -0
wxpath-0.2.0/src/wxpath/core/runtime/helpers.py +48 -0
wxpath-0.2.0/src/wxpath/hooks/__init__.py +9 -0
wxpath-0.2.0/src/wxpath/hooks/builtin.py +113 -0
wxpath-0.2.0/src/wxpath/hooks/registry.py +133 -0
wxpath-0.2.0/src/wxpath/http/__init__.py +0 -0
wxpath-0.2.0/src/wxpath/http/client/__init__.py +9 -0
wxpath-0.2.0/src/wxpath/http/client/crawler.py +196 -0
wxpath-0.2.0/src/wxpath/http/client/request.py +35 -0
wxpath-0.2.0/src/wxpath/http/client/response.py +14 -0
wxpath-0.2.0/src/wxpath/http/policy/backoff.py +16 -0
wxpath-0.2.0/src/wxpath/http/policy/retry.py +35 -0
wxpath-0.2.0/src/wxpath/http/policy/throttler.py +114 -0
wxpath-0.2.0/src/wxpath/http/stats.py +96 -0
{wxpath-0.1.0 → wxpath-0.2.0}/src/wxpath/patches.py +7 -2
wxpath-0.2.0/src/wxpath/util/__init__.py +0 -0
wxpath-0.2.0/src/wxpath/util/logging.py +91 -0
wxpath-0.2.0/src/wxpath/util/serialize.py +22 -0
{wxpath-0.1.0 → wxpath-0.2.0/src/wxpath.egg-info}/PKG-INFO +30 -97
wxpath-0.2.0/src/wxpath.egg-info/SOURCES.txt +36 -0
{wxpath-0.1.0 → wxpath-0.2.0}/src/wxpath.egg-info/requires.txt +5 -2
wxpath-0.2.0/src/wxpath.egg-info/top_level.txt +1 -0
wxpath-0.1.0/src/wxpath.egg-info/SOURCES.txt +0 -12
wxpath-0.1.0/src/wxpath.egg-info/top_level.txt +0 -1
{wxpath-0.1.0 → wxpath-0.2.0}/LICENSE +0 -0
{wxpath-0.1.0 → wxpath-0.2.0}/setup.cfg +0 -0
{wxpath-0.1.0 → wxpath-0.2.0}/src/wxpath/__init__.py +0 -0
{wxpath-0.1.0 → wxpath-0.2.0}/src/wxpath.egg-info/dependency_links.txt +0 -0
{wxpath-0.1.0 → wxpath-0.2.0}/src/wxpath.egg-info/entry_points.txt +0 -0

{wxpath-0.1.0/src/wxpath.egg-info → wxpath-0.2.0}/PKG-INFO RENAMED Viewed

@@ -1,6 +1,6 @@
 Metadata-Version: 2.4
 Name: wxpath
-Version: 0.1.0
+Version: 0.2.0
 Summary: wxpath - a declarative web crawler and data extractor
 Author-email: Rodrigo Palacios <rodrigopala91@gmail.com>
 License-Expression: MIT
@@ -9,11 +9,13 @@ Description-Content-Type: text/markdown
 License-File: LICENSE
 Requires-Dist: requests>=2.0
 Requires-Dist: lxml>=4.0
-Requires-Dist: elementpath>=5.0.0
-Requires-Dist: aiohttp>=3.8.0
+Requires-Dist: elementpath<=5.0.3,>=5.0.0
+Requires-Dist: aiohttp<=3.12.15,>=3.8.0
 Provides-Extra: test
 Requires-Dist: pytest>=7.0; extra == "test"
 Requires-Dist: pytest-asyncio>=0.23; extra == "test"
+Provides-Extra: dev
+Requires-Dist: ruff; extra == "dev"
 Dynamic: license-file
@@ -25,10 +27,11 @@ By introducing the `url(...)` operator and the `///` syntax, **wxpath**'s engine
 NOTE: This project is in early development. Core concepts are stable, but the API and features may change. Please report issues - in particular, deadlocked crawls or unexpected behavior - and any features you'd like to see (no guarantee they'll be implemented).
 ## Contents
 - [Example](#example)
-- [`url(...)` and `///` Explained](#url-and---explained)
+- [`url(...)` and `///url(...)` Explained](#url-and---explained)
 - [General flow](#general-flow)
 - [Asynchronous Crawling](#asynchronous-crawling)
 - [Output types](#output-types)
@@ -37,11 +40,13 @@ NOTE: This project is in early development. Core concepts are stable, but the AP
 - [Hooks (Experimental)](#hooks-experimental)
 - [Install](#install)
 - [More Examples](#more-examples)
+- [Comparisons](#comparisons)
 - [Advanced: Engine & Crawler Configuration](#advanced-engine--crawler-configuration)
 - [Project Philosophy](#project-philosophy)
 - [Warnings](#warnings)
 - [License](#license)
 ## Example
 ```python
@@ -49,7 +54,7 @@ import wxpath
 path = """
 url('https://en.wikipedia.org/wiki/Expression_language')
- ///main//a/@href[starts-with(., '/wiki/') and not(contains(., ':'))]/url(.)
+ ///url(//main//a/@href[starts-with(., '/wiki/') and not(contains(., ':'))])
  /map{
     'title':(//span[contains(@class, "mw-page-title-main")]/text())[1],
     'url':string(base-uri(.)),
@@ -84,10 +89,11 @@ The above expression does the following:
 4. Streams the extracted data as it is discovered.
-## `url(...)` and `///` Explained
+## `url(...)` and `///url(...)` Explained
 - `url(...)` is a custom operator that fetches the content of the user-specified or internally generated URL and returns it as an `lxml.html.HtmlElement` for further XPath processing.
-- `///` indicates infinite/recursive traversal. It tells **wxpath** to continue following links indefinitely, up to the specified `max_depth`. Unlike repeated `url()` hops, it allows a single expression to describe unbounded graph exploration. WARNING: Use with caution and constraints (via `max_depth` or XPath predicates) to avoid traversal explosion.
+- `///url(...)` indicates infinite/recursive traversal. It tells **wxpath** to continue following links indefinitely, up to the specified `max_depth`. Unlike repeated `url()` hops, it allows a single expression to describe unbounded graph exploration. WARNING: Use with caution and constraints (via `max_depth` or XPath predicates) to avoid traversal explosion.
 ## General flow
@@ -97,14 +103,13 @@ The above expression does the following:
 XPath segments operate on fetched documents (fetched via the immediately preceding `url(...)` operations).
-`///` indicates infinite/recursive traversal - it proceeds breadth-first-*ish* up to `max_depth`.
+`///url(...)` indicates infinite/recursive traversal - it proceeds breadth-first-*ish* up to `max_depth`.
 Results are yielded as soon as they are ready.
 ## Asynchronous Crawling
 **wxpath** is `asyncio/aiohttp`-first, providing an asynchronous API for crawling and extracting data.
 ```python
@@ -114,7 +119,7 @@ from wxpath import wxpath_async
 items = []
 async def main():
-    path_expr = "url('https://en.wikipedia.org/wiki/Expression_language')///url(@href[starts-with(., '/wiki/')])//a/@href"
+    path_expr = "url('https://en.wikipedia.org/wiki/Expression_language')///url(//@href[starts-with(., '/wiki/')])//a/@href"
     async for item in wxpath_async(path_expr, max_depth=1):
         items.append(item)
@@ -123,16 +128,16 @@ asyncio.run(main())
 ### Blocking, Concurrent Requests
 **wxpath** also supports concurrent requests using an asyncio-in-sync pattern, allowing you to crawl multiple pages concurrently while maintaining the simplicity of synchronous code. This is particularly useful for crawls in strictly synchronous execution environments (i.e., not inside an `asyncio` event loop) where performance is a concern.
 ```python
 from wxpath import wxpath_async_blocking_iter
-path_expr = "url('https://en.wikipedia.org/wiki/Expression_language')///url(@href[starts-with(., '/wiki/')])//a/@href"
+path_expr = "url('https://en.wikipedia.org/wiki/Expression_language')///url(//@href[starts-with(., '/wiki/')])//a/@href"
 items = list(wxpath_async_blocking_iter(path_expr, max_depth=1))
 ```
 ## Output types
 The wxpath Python API yields structured objects, not just strings.
@@ -156,7 +161,7 @@ The Python API preserves structure by default.
 ```python
 path_expr = """
     url('https://en.wikipedia.org/wiki/Expression_language')
-    ///div[@id='mw-content-text']//a/url(@href)
+    ///url(//div[@id='mw-content-text']//a/@href)
     /map{
         'title':(//span[contains(@class, "mw-page-title-main")]/text())[1],
         'short_description':(//div[contains(@class, "shortdescription")]/text())[1],
@@ -176,15 +181,18 @@ path_expr = """
 # ...]
 ```
 ## CLI
 **wxpath** provides a command-line interface (CLI) to quickly experiment and execute wxpath expressions directly from the terminal.
+The following example demonstrates how to crawl Wikipedia starting from the "Expression language" page, extract links to other wiki pages, and retrieve specific fields from each linked page.
+WARNING: Due to the everchanging nature of web content, the output may vary over time.
 ```bash
 > wxpath --depth 1 "\
     url('https://en.wikipedia.org/wiki/Expression_language')\
-    ///div[@id='mw-content-text'] \
-    //a/url(@href[starts-with(., '/wiki/') \
+    ///url(//div[@id='mw-content-text']//a/@href[starts-with(., '/wiki/') \
         and not(matches(@href, '^(?:/wiki/)?(?:Wikipedia|File|Template|Special|Template_talk|Help):'))]) \
     /map{ \
         'title':(//span[contains(@class, 'mw-page-title-main')]/text())[1], \
@@ -256,90 +264,13 @@ pip install wxpath
 ## More Examples
-```python
-import wxpath
+See [EXAMPLES.md](EXAMPLES.md) for more usage examples.
-#### EXAMPLE 1 - Simple, single page crawl and link extraction #######
-#
-# Starting from Expression language's wiki, extract all links (hrefs)
-# from the main section. The `url(...)` operator is used to execute a
-# web request to the specified URL and return the HTML content.
-#
-path_expr = "url('https://en.wikipedia.org/wiki/Expression_language')//main//a/@href"
-items = wxpath.wxpath_async_blocking(path_expr)
-#### EXAMPLE 2 - Two-deep crawl and link extraction ##################
-#
-# Starting from Expression language's wiki, crawl all child links
-# starting with '/wiki/', and extract each child's links (hrefs). The
-# `url(...)` operator is pipe'd arguments from the evaluated XPath.
-#
-path_expr = "url('https://en.wikipedia.org/wiki/Expression_language')//url(@href[starts-with(., '/wiki/')])//a/@href"
-#### EXAMPLE 3 - Infinite crawl with BFS tree depth limit ############
-#
-# Starting from Expression language's wiki, infinitely crawl all child
-# links (and child's child's links recursively). The `///` syntax is
-# used to indicate an infinite crawl.
-# Returns lxml.html.HtmlElement objects.
-#
-path_expr = "url('https://en.wikipedia.org/wiki/Expression_language')///main//a/url(@href)"
-# The same expression written differently:
-path_expr = "url('https://en.wikipedia.org/wiki/Expression_language')///url(//main//a/@href)"
-# Modify (inclusive) max_depth to limit the BFS tree (crawl depth).
-items = wxpath.wxpath_async_blocking(path_expr, max_depth=1)
-#### EXAMPLE 4 - Infinite crawl with field extraction ################
-#
-# Infinitely crawls Expression language's wiki's child links and
-# childs' child links (recursively) and then, for each child link
-# crawled, extracts objects with the named fields as a dict.
-#
-path_expr = """
-    url('https://en.wikipedia.org/wiki/Expression_language')
-     ///main//a/url(@href)
-     /map {
-        'title':(//span[contains(@class, "mw-page-title-main")]/text())[1],
-        'short_description':(//div[contains(@class, "shortdescription")]/text())[1],
-        'url'://link[@rel='canonical']/@href[1],
-        'backlink':wx:backlink(.),
-        'depth':wx:depth(.)
-    }
-"""
-# Under the hood of wxpath.core.wxpath, we generate `segments` list,
-# revealing the operations executed to accomplish the crawl.
-# >> segments = wxpath.core.parser.parse_wxpath_expr(path_expr);
-# >> segments
-# [Segment(op='url', value='https://en.wikipedia.org/wiki/Expression_language'),
-#  Segment(op='url_inf', value='///url(//main//a/@href)'),
-#  Segment(op='xpath', value='/map {        \'title\':(//span[contains(@class, "mw-page-title-main")]/text())[1],         \'short_description\':(//div[contains(@class, "shortdescription")]/text())[1],        \'url\'://link[@rel=\'canonical\']/@href[1]    }')]
-#### EXAMPLE 5 = Seeding from XPath function expression + mapping operator (`!`)
-#
-# Functionally create 10 Amazon book search result page URLs, map each URL to
-# the url(.) operator, and for each page, extract the title, price, and link of
-# each book listed.
-#
-base_url = "https://www.amazon.com/s?k=books&i=stripbooks&page="
-path_expr = f"""
-    (1 to 10) ! ('{base_url}' || .) !
-    url(.)
-        //span[@data-component-type='s-search-results']//*[@role='listitem']
-            /map {{
-                'title': (.//h2/span/text())[1],
-                'price': (.//span[@class='a-price']/span[@class='a-offscreen']/text())[1],
-                'link': (.//a[@aria-describedby='price-link']/@href)[1]
-            }}
-"""
+## Comparisons
+See [COMPARISONS.md](COMPARISONS.md) for comparisons with other web-scraping tools.
-items = list(wxpath.wxpath_async_blocking_iter(path_expr, max_depth=1))
-```
 ## Advanced: Engine & Crawler Configuration
@@ -364,7 +295,7 @@ engine = WXPathEngine(
     crawler=crawler,
 )
-path_expr = "url('https://en.wikipedia.org/wiki/Expression_language')///main//a/url(@href)"
+path_expr = "url('https://en.wikipedia.org/wiki/Expression_language')//url(//main//a/@href)"
 items = list(wxpath_async_blocking_iter(path_expr, max_depth=1, engine=engine))
 ```
@@ -392,6 +323,7 @@ items = list(wxpath_async_blocking_iter(path_expr, max_depth=1, engine=engine))
 - Automatic proxy rotation
 - Browser-based rendering (JavaScript execution)
 ## WARNINGS!!!
 - Be respectful when crawling websites. A scrapy-inspired throttler is enabled by default.
@@ -399,6 +331,7 @@ items = list(wxpath_async_blocking_iter(path_expr, max_depth=1, engine=engine))
 - Deadlocks and hangs are possible in certain situations (e.g., all tasks waiting on blocked requests). Please report issues if you encounter such behavior.
 - Consider using timeouts, `max_depth`, and XPath predicates and filters to limit crawl scope.
 ## License
 MIT

{wxpath-0.1.0 → wxpath-0.2.0}/README.md RENAMED Viewed

@@ -7,10 +7,11 @@ By introducing the `url(...)` operator and the `///` syntax, **wxpath**'s engine
 NOTE: This project is in early development. Core concepts are stable, but the API and features may change. Please report issues - in particular, deadlocked crawls or unexpected behavior - and any features you'd like to see (no guarantee they'll be implemented).
 ## Contents
 - [Example](#example)
-- [`url(...)` and `///` Explained](#url-and---explained)
+- [`url(...)` and `///url(...)` Explained](#url-and---explained)
 - [General flow](#general-flow)
 - [Asynchronous Crawling](#asynchronous-crawling)
 - [Output types](#output-types)
@@ -19,11 +20,13 @@ NOTE: This project is in early development. Core concepts are stable, but the AP
 - [Hooks (Experimental)](#hooks-experimental)
 - [Install](#install)
 - [More Examples](#more-examples)
+- [Comparisons](#comparisons)
 - [Advanced: Engine & Crawler Configuration](#advanced-engine--crawler-configuration)
 - [Project Philosophy](#project-philosophy)
 - [Warnings](#warnings)
 - [License](#license)
 ## Example
 ```python
@@ -31,7 +34,7 @@ import wxpath
 path = """
 url('https://en.wikipedia.org/wiki/Expression_language')
- ///main//a/@href[starts-with(., '/wiki/') and not(contains(., ':'))]/url(.)
+ ///url(//main//a/@href[starts-with(., '/wiki/') and not(contains(., ':'))])
  /map{
     'title':(//span[contains(@class, "mw-page-title-main")]/text())[1],
     'url':string(base-uri(.)),
@@ -66,10 +69,11 @@ The above expression does the following:
 4. Streams the extracted data as it is discovered.
-## `url(...)` and `///` Explained
+## `url(...)` and `///url(...)` Explained
 - `url(...)` is a custom operator that fetches the content of the user-specified or internally generated URL and returns it as an `lxml.html.HtmlElement` for further XPath processing.
-- `///` indicates infinite/recursive traversal. It tells **wxpath** to continue following links indefinitely, up to the specified `max_depth`. Unlike repeated `url()` hops, it allows a single expression to describe unbounded graph exploration. WARNING: Use with caution and constraints (via `max_depth` or XPath predicates) to avoid traversal explosion.
+- `///url(...)` indicates infinite/recursive traversal. It tells **wxpath** to continue following links indefinitely, up to the specified `max_depth`. Unlike repeated `url()` hops, it allows a single expression to describe unbounded graph exploration. WARNING: Use with caution and constraints (via `max_depth` or XPath predicates) to avoid traversal explosion.
 ## General flow
@@ -79,14 +83,13 @@ The above expression does the following:
 XPath segments operate on fetched documents (fetched via the immediately preceding `url(...)` operations).
-`///` indicates infinite/recursive traversal - it proceeds breadth-first-*ish* up to `max_depth`.
+`///url(...)` indicates infinite/recursive traversal - it proceeds breadth-first-*ish* up to `max_depth`.
 Results are yielded as soon as they are ready.
 ## Asynchronous Crawling
 **wxpath** is `asyncio/aiohttp`-first, providing an asynchronous API for crawling and extracting data.
 ```python
@@ -96,7 +99,7 @@ from wxpath import wxpath_async
 items = []
 async def main():
-    path_expr = "url('https://en.wikipedia.org/wiki/Expression_language')///url(@href[starts-with(., '/wiki/')])//a/@href"
+    path_expr = "url('https://en.wikipedia.org/wiki/Expression_language')///url(//@href[starts-with(., '/wiki/')])//a/@href"
     async for item in wxpath_async(path_expr, max_depth=1):
         items.append(item)
@@ -105,16 +108,16 @@ asyncio.run(main())
 ### Blocking, Concurrent Requests
 **wxpath** also supports concurrent requests using an asyncio-in-sync pattern, allowing you to crawl multiple pages concurrently while maintaining the simplicity of synchronous code. This is particularly useful for crawls in strictly synchronous execution environments (i.e., not inside an `asyncio` event loop) where performance is a concern.
 ```python
 from wxpath import wxpath_async_blocking_iter
-path_expr = "url('https://en.wikipedia.org/wiki/Expression_language')///url(@href[starts-with(., '/wiki/')])//a/@href"
+path_expr = "url('https://en.wikipedia.org/wiki/Expression_language')///url(//@href[starts-with(., '/wiki/')])//a/@href"
 items = list(wxpath_async_blocking_iter(path_expr, max_depth=1))
 ```
 ## Output types
 The wxpath Python API yields structured objects, not just strings.
@@ -138,7 +141,7 @@ The Python API preserves structure by default.
 ```python
 path_expr = """
     url('https://en.wikipedia.org/wiki/Expression_language')
-    ///div[@id='mw-content-text']//a/url(@href)
+    ///url(//div[@id='mw-content-text']//a/@href)
     /map{
         'title':(//span[contains(@class, "mw-page-title-main")]/text())[1],
         'short_description':(//div[contains(@class, "shortdescription")]/text())[1],
@@ -158,15 +161,18 @@ path_expr = """
 # ...]
 ```
 ## CLI
 **wxpath** provides a command-line interface (CLI) to quickly experiment and execute wxpath expressions directly from the terminal.
+The following example demonstrates how to crawl Wikipedia starting from the "Expression language" page, extract links to other wiki pages, and retrieve specific fields from each linked page.
+WARNING: Due to the everchanging nature of web content, the output may vary over time.
 ```bash
 > wxpath --depth 1 "\
     url('https://en.wikipedia.org/wiki/Expression_language')\
-    ///div[@id='mw-content-text'] \
-    //a/url(@href[starts-with(., '/wiki/') \
+    ///url(//div[@id='mw-content-text']//a/@href[starts-with(., '/wiki/') \
         and not(matches(@href, '^(?:/wiki/)?(?:Wikipedia|File|Template|Special|Template_talk|Help):'))]) \
     /map{ \
         'title':(//span[contains(@class, 'mw-page-title-main')]/text())[1], \
@@ -238,90 +244,13 @@ pip install wxpath
 ## More Examples
-```python
-import wxpath
+See [EXAMPLES.md](EXAMPLES.md) for more usage examples.
-#### EXAMPLE 1 - Simple, single page crawl and link extraction #######
-#
-# Starting from Expression language's wiki, extract all links (hrefs)
-# from the main section. The `url(...)` operator is used to execute a
-# web request to the specified URL and return the HTML content.
-#
-path_expr = "url('https://en.wikipedia.org/wiki/Expression_language')//main//a/@href"
-items = wxpath.wxpath_async_blocking(path_expr)
-#### EXAMPLE 2 - Two-deep crawl and link extraction ##################
-#
-# Starting from Expression language's wiki, crawl all child links
-# starting with '/wiki/', and extract each child's links (hrefs). The
-# `url(...)` operator is pipe'd arguments from the evaluated XPath.
-#
-path_expr = "url('https://en.wikipedia.org/wiki/Expression_language')//url(@href[starts-with(., '/wiki/')])//a/@href"
-#### EXAMPLE 3 - Infinite crawl with BFS tree depth limit ############
-#
-# Starting from Expression language's wiki, infinitely crawl all child
-# links (and child's child's links recursively). The `///` syntax is
-# used to indicate an infinite crawl.
-# Returns lxml.html.HtmlElement objects.
-#
-path_expr = "url('https://en.wikipedia.org/wiki/Expression_language')///main//a/url(@href)"
-# The same expression written differently:
-path_expr = "url('https://en.wikipedia.org/wiki/Expression_language')///url(//main//a/@href)"
-# Modify (inclusive) max_depth to limit the BFS tree (crawl depth).
-items = wxpath.wxpath_async_blocking(path_expr, max_depth=1)
-#### EXAMPLE 4 - Infinite crawl with field extraction ################
-#
-# Infinitely crawls Expression language's wiki's child links and
-# childs' child links (recursively) and then, for each child link
-# crawled, extracts objects with the named fields as a dict.
-#
-path_expr = """
-    url('https://en.wikipedia.org/wiki/Expression_language')
-     ///main//a/url(@href)
-     /map {
-        'title':(//span[contains(@class, "mw-page-title-main")]/text())[1],
-        'short_description':(//div[contains(@class, "shortdescription")]/text())[1],
-        'url'://link[@rel='canonical']/@href[1],
-        'backlink':wx:backlink(.),
-        'depth':wx:depth(.)
-    }
-"""
-# Under the hood of wxpath.core.wxpath, we generate `segments` list,
-# revealing the operations executed to accomplish the crawl.
-# >> segments = wxpath.core.parser.parse_wxpath_expr(path_expr);
-# >> segments
-# [Segment(op='url', value='https://en.wikipedia.org/wiki/Expression_language'),
-#  Segment(op='url_inf', value='///url(//main//a/@href)'),
-#  Segment(op='xpath', value='/map {        \'title\':(//span[contains(@class, "mw-page-title-main")]/text())[1],         \'short_description\':(//div[contains(@class, "shortdescription")]/text())[1],        \'url\'://link[@rel=\'canonical\']/@href[1]    }')]
-#### EXAMPLE 5 = Seeding from XPath function expression + mapping operator (`!`)
-#
-# Functionally create 10 Amazon book search result page URLs, map each URL to
-# the url(.) operator, and for each page, extract the title, price, and link of
-# each book listed.
-#
-base_url = "https://www.amazon.com/s?k=books&i=stripbooks&page="
-path_expr = f"""
-    (1 to 10) ! ('{base_url}' || .) !
-    url(.)
-        //span[@data-component-type='s-search-results']//*[@role='listitem']
-            /map {{
-                'title': (.//h2/span/text())[1],
-                'price': (.//span[@class='a-price']/span[@class='a-offscreen']/text())[1],
-                'link': (.//a[@aria-describedby='price-link']/@href)[1]
-            }}
-"""
+## Comparisons
+See [COMPARISONS.md](COMPARISONS.md) for comparisons with other web-scraping tools.
-items = list(wxpath.wxpath_async_blocking_iter(path_expr, max_depth=1))
-```
 ## Advanced: Engine & Crawler Configuration
@@ -346,7 +275,7 @@ engine = WXPathEngine(
     crawler=crawler,
 )
-path_expr = "url('https://en.wikipedia.org/wiki/Expression_language')///main//a/url(@href)"
+path_expr = "url('https://en.wikipedia.org/wiki/Expression_language')//url(//main//a/@href)"
 items = list(wxpath_async_blocking_iter(path_expr, max_depth=1, engine=engine))
 ```
@@ -374,6 +303,7 @@ items = list(wxpath_async_blocking_iter(path_expr, max_depth=1, engine=engine))
 - Automatic proxy rotation
 - Browser-based rendering (JavaScript execution)
 ## WARNINGS!!!
 - Be respectful when crawling websites. A scrapy-inspired throttler is enabled by default.
@@ -381,6 +311,7 @@ items = list(wxpath_async_blocking_iter(path_expr, max_depth=1, engine=engine))
 - Deadlocks and hangs are possible in certain situations (e.g., all tasks waiting on blocked requests). Please report issues if you encounter such behavior.
 - Consider using timeouts, `max_depth`, and XPath predicates and filters to limit crawl scope.
 ## License
 MIT

{wxpath-0.1.0 → wxpath-0.2.0}/pyproject.toml RENAMED Viewed

@@ -4,7 +4,7 @@ build-backend = "setuptools.build_meta"
 [project]
 name = "wxpath"
-version = "0.1.0"
+version = "0.2.0"
 description = "wxpath - a declarative web crawler and data extractor"
 readme = "README.md"
 requires-python = ">=3.9"
@@ -16,12 +16,13 @@ license-files = ["LICENSE"]
 dependencies = [
     "requests>=2.0",
     "lxml>=4.0",
-    "elementpath>=5.0.0",
-    "aiohttp>=3.8.0"
+    "elementpath>=5.0.0,<=5.0.3",
+    "aiohttp>=3.8.0,<=3.12.15"
 ]
 [project.optional-dependencies]
 test = ["pytest>=7.0", "pytest-asyncio>=0.23"]
+dev = ["ruff"]
 [project.scripts]
 wxpath = "wxpath.cli:main"
@@ -32,7 +33,24 @@ addopts = "-ra -q"
 testpaths = ["tests"]
 [tool.setuptools]
-package-dir = {"" = "src"}
 [tool.setuptools.packages.find]
-include = ["wxpath"]
+where = ["src"]
+include = ["wxpath", "wxpath.*"]
+[tool.ruff]
+target-version = "py311"
+line-length = 100
+lint.select = [
+    "F",      # pyflakes (unused vars, undefined names, etc.)
+    "E",      # pycodestyle errors
+    "B",      # flake8-bugbear (real footguns)
+    "ASYNC",  # async/await correctness
+    "I",      # isort rules
+    "TID",    # Tidy imports
+    "ICN",    # Import conventions
+]
+[tool.ruff.format]
+quote-style = "single"

{wxpath-0.1.0 → wxpath-0.2.0}/src/wxpath/cli.py RENAMED Viewed

@@ -2,34 +2,14 @@ import argparse
 import json
 import sys
-from wxpath.hooks import builtin # load default hooks
-from wxpath.core.ops import WxStr
 from wxpath.core.parser import parse_wxpath_expr
-from wxpath.core.runtime.engine import wxpath_async_blocking_iter, WXPathEngine
-def _simplify(obj):
-    """
-    Recursively convert custom wrapper types (e.g., WxStr / ExtractedStr,
-    lxml elements) into plain built-in Python types so that printing or
-    JSON serialising shows clean values.
-    """
-    # Scalars
-    if isinstance(obj, WxStr):
-        return str(obj)
-    # Mapping
-    if isinstance(obj, dict):
-        return {k: _simplify(v) for k, v in obj.items()}
-    # Sequence (but not str/bytes)
-    if isinstance(obj, (list, tuple, set)):
-        return type(obj)(_simplify(v) for v in obj)
-    return obj
+from wxpath.core.runtime.engine import WXPathEngine, wxpath_async_blocking_iter
+from wxpath.hooks import builtin, registry
+from wxpath.util.serialize import simplify
 def main():
+    registry.register(builtin.SerializeXPathMapAndNodeHook)
     parser = argparse.ArgumentParser(description="Run wxpath expression.")
     parser.add_argument("expression", help="The wxpath expression")
     parser.add_argument("--depth", type=int, default=1, help="Recursion depth")
@@ -39,7 +19,12 @@ def main():
     parser.add_argument("--verbose", action="store_true", help="Verbose mode")
     parser.add_argument("--concurrency", type=int, default=16, help="Number of concurrent fetches")
-    parser.add_argument("--concurrency-per-host", type=int, default=8, help="Number of concurrent fetches per host")
+    parser.add_argument(
+        "--concurrency-per-host",
+        type=int,
+        default=8,
+        help="Number of concurrent fetches per host"
+    )
     args = parser.parse_args()
@@ -48,8 +33,8 @@ def main():
         print("parsed expression:", parse_wxpath_expr(args.expression))
     if args.debug:
-        from wxpath import configure_logging, logging
-        configure_logging(logging.DEBUG)
+        from wxpath import configure_logging
+        configure_logging('DEBUG')
     engine = WXPathEngine(
         concurrency=args.concurrency,
@@ -57,7 +42,7 @@ def main():
     )
     try:
         for r in wxpath_async_blocking_iter(args.expression, args.depth, engine):
-            clean = _simplify(r)
+            clean = simplify(r)
             print(json.dumps(clean, ensure_ascii=False), flush=True)
     except BrokenPipeError:
         sys.exit(0)

wxpath-0.2.0/src/wxpath/core/__init__.py ADDED Viewed

@@ -0,0 +1,13 @@
+from wxpath.core.runtime.engine import (
+    WXPathEngine,
+    wxpath_async,
+    wxpath_async_blocking,
+    wxpath_async_blocking_iter,
+)
+__all__ = [
+    'wxpath_async',
+    'wxpath_async_blocking',
+    'wxpath_async_blocking_iter',
+    'WXPathEngine',
+]

wxpath-0.2.0/src/wxpath/core/dom.py ADDED Viewed

@@ -0,0 +1,22 @@
+from urllib.parse import urljoin
+def _make_links_absolute(links: list[str], base_url: str) -> list[str]:
+    """
+    Convert relative links to absolute links based on the base URL.
+    Args:
+        links (list): List of link strings.
+        base_url (str): The base URL to resolve relative links against.
+    Returns:
+        List of absolute URLs.
+    """
+    if base_url is None:
+        raise ValueError("base_url must not be None when making links absolute.")
+    return [urljoin(base_url, link) for link in links if link]
+def get_absolute_links_from_elem_and_xpath(elem, xpath):
+    base_url = getattr(elem, 'base_url', None)
+    return _make_links_absolute(elem.xpath3(xpath), base_url)

wxpath 0.1.0__tar.gz → 0.2.0__tar.gz

wxpath 0.1.0tar.gz → 0.2.0tar.gz