PyPI - wxpath - Versions diffs - 0.4.1__tar.gz → 0.5.0__tar.gz - Mend

wxpath 0.4.1tar.gz → 0.5.0tar.gz

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (51) hide show

{wxpath-0.4.1 → wxpath-0.5.0}/PKG-INFO RENAMED Viewed

@@ -1,9 +1,14 @@
 Metadata-Version: 2.4
 Name: wxpath
-Version: 0.4.1
+Version: 0.5.0
 Summary: wxpath - a declarative web crawler and data extractor
 Author-email: Rodrigo Palacios <rodrigopala91@gmail.com>
 License-Expression: MIT
+Project-URL: Homepage, https://rodricios.github.io/wxpath
+Project-URL: Documentation, https://rodricios.github.io/wxpath
+Project-URL: Repository, https://github.com/rodricios/wxpath
+Project-URL: Issues, https://github.com/rodricios/wxpath/issues
+Project-URL: Changelog, https://github.com/rodricios/wxpath/blob/main/CHANGELOG.md
 Requires-Python: >=3.10
 Description-Content-Type: text/markdown
 License-File: LICENSE
@@ -17,16 +22,55 @@ Provides-Extra: cache-sqlite
 Requires-Dist: aiohttp-client-cache[sqlite]; extra == "cache-sqlite"
 Provides-Extra: cache-redis
 Requires-Dist: aiohttp-client-cache[redis]; extra == "cache-redis"
+Provides-Extra: llm
+Requires-Dist: langchain>=1.0.0; extra == "llm"
+Requires-Dist: langchain-core>=1.0.0; extra == "llm"
+Requires-Dist: langchain-ollama>=1.0.0; extra == "llm"
+Requires-Dist: langchain-community>=0.4.0; extra == "llm"
+Requires-Dist: langchain-chroma>=1.0.0; extra == "llm"
+Requires-Dist: chromadb>=1.0.0; extra == "llm"
+Requires-Dist: langchain-text-splitters>=1.1.0; extra == "llm"
 Provides-Extra: test
 Requires-Dist: pytest>=7.0; extra == "test"
 Requires-Dist: pytest-asyncio>=0.23; extra == "test"
 Provides-Extra: dev
 Requires-Dist: ruff; extra == "dev"
+Provides-Extra: docs
+Requires-Dist: mkdocs>=1.5; extra == "docs"
+Requires-Dist: mkdocs-material>=9.0; extra == "docs"
+Requires-Dist: mkdocstrings[python]>=0.24; extra == "docs"
+Requires-Dist: mkdocs-macros-plugin>=1.0; extra == "docs"
+Requires-Dist: mkdocs-resize-images>=1.0; extra == "docs"
+Requires-Dist: mkdocs-glightbox; extra == "docs"
+Requires-Dist: pyyaml>=6.0; extra == "docs"
+Provides-Extra: tui
+Requires-Dist: textual>=1.0.0; extra == "tui"
+Requires-Dist: aiohttp-client-cache>=0.14.0; extra == "tui"
+Requires-Dist: aiohttp-client-cache[sqlite]; extra == "tui"
 Dynamic: license-file
-# **wxpath** - declarative web crawling with XPath
+# **wxpath** - declarative web graph traversal with XPath
-[![Python 3.10+](https://img.shields.io/badge/python-3.10+-blue.svg)](https://www.python.org/downloads/release/python-3100/)
+[![Python 3.10+](https://img.shields.io/badge/python-3.10+-blue.svg)](https://www.python.org/downloads/release/python-3100/) [![Documentation Status](https://img.shields.io/badge/documentation-green.svg)](https://rodricios.github.io/wxpath)
+> NEW: [TUI](https://rodricios.github.io/wxpath/tui/quickstart.md) - Interactive terminal interface (powered by Textual) for testing wxpath expressions and exporting data.
+![Wxpath TUI Demo screenshot](docs/assets/images/demo1.jpg)
+## Install
+Requires Python 3.10+.
+```
+pip install wxpath
+# For TUI support
+pip install wxpath[tui]
+```
+---
+## What is wxpath?
 **wxpath** is a declarative web crawler where traversal is expressed directly in XPath. Instead of writing imperative crawl loops, wxpath lets you describe what to follow and what to extract in a single expression. **wxpath** executes that expression concurrently, breadth-first-*ish*, and streams results as they are discovered.
@@ -35,14 +79,14 @@ This expression fetches a page, extracts links, and streams them concurrently -
 ```python
 import wxpath
-expr = "url('https://example.com')//a/@href"
+expr = "url('https://quotes.toscrape.com')//a/@href"
 for link in wxpath.wxpath_async_blocking_iter(expr):
     print(link)
 ```
-By introducing the `url(...)` operator and the `///` syntax, wxpath's engine is able to perform deep (or paginated) web crawling and extraction:
+By introducing the `url(...)` operator and the `///` syntax, wxpath's engine is able to perform recursive (or paginated) web crawling and extraction:
 ```python
 import wxpath
@@ -62,15 +106,28 @@ for item in wxpath.wxpath_async_blocking_iter(path_expr, max_depth=1):
 Most web scrapers force you to write crawl control flow first, and extraction second.
-**wxpath** inverts that:
+**wxpath** converges those two steps into one:
 - **You describe traversal declaratively**
 - **Extraction is expressed inline**
 - **The engine handles scheduling, concurrency, and deduplication**
+### RAG-Ready Output
+Extract clean, structured JSON hierarchies directly from the graph - feed your LLMs signal, not noise. Refer to [LangChain Integration](https://rodricios.github.io/wxpath/api/integrations/langchain/) for more details.
+### Deterministic
+**wxpath** is deterministic (read: not powered by LLMs). While we can't guarantee the network is stable, we can guarantee the traversal is.
+## Documentation (WIP)
+Documentation is now available [here](https://rodricios.github.io/wxpath/).
 ## Contents
-- [Example](#example)
+- [Example: Knowledge Graph](#example)
 - [Language Design](DESIGN.md)
 - [`url(...)` and `///url(...)` Explained](#url-and-url-explained)
 - [General flow](#general-flow)
@@ -80,6 +137,7 @@ Most web scrapers force you to write crawl control flow first, and extraction se
 - [XPath 3.1](#xpath-31-by-default)
 - [Progress Bar](#progress-bar)
 - [CLI](#cli)
+- [TUI](#tui)
 - [Persistence and Caching](#persistence-and-caching)
 - [Settings](#settings)
 - [Hooks (Experimental)](#hooks-experimental)
@@ -294,12 +352,17 @@ Command line options:
 --cache                [true|false] (Default: False) Persist crawl results to a local database
 ```
+## TUI
+**wxpath** provides a terminal interface (TUI) for interactive expression testing and data extraction.
+See [TUI Quickstart](https://rodricios.github.io/wxpath/tui/quickstart.md) for more details.
 ## Persistence and Caching
 **wxpath** optionally persists crawl results to a local database. This is especially useful when you're crawling a large number of URLs, and you decide to pause the crawl, change extraction expressions, or otherwise need to restart the crawl.
-**wxpath** supports two backends: sqlite and redis. SQLite is great for small-scale crawls, with a single worker (i.e., `engine.crawler.concurrency == 1`). Redis is great for large-scale crawls, with multiple workers. You will be encounter a warning if you `min(engine.crawler.concurrency, engine.crawler.per_host) > 1` when using the sqlite backend.
+**wxpath** supports two backends: sqlite and redis. SQLite is great for small-scale crawls, with a single worker (i.e., `engine.crawler.concurrency == 1`). Redis is great for large-scale crawls, with multiple workers. You will encounter a warning if `min(engine.crawler.concurrency, engine.crawler.per_host) > 1` when using the sqlite backend.
 To use, you must install the appropriate optional dependency:

wxpath-0.4.1/src/wxpath.egg-info/PKG-INFO → wxpath-0.5.0/README.md RENAMED Viewed

@@ -1,32 +1,25 @@
-Metadata-Version: 2.4
-Name: wxpath
-Version: 0.4.1
-Summary: wxpath - a declarative web crawler and data extractor
-Author-email: Rodrigo Palacios <rodrigopala91@gmail.com>
-License-Expression: MIT
-Requires-Python: >=3.10
-Description-Content-Type: text/markdown
-License-File: LICENSE
-Requires-Dist: lxml>=4.0
-Requires-Dist: elementpath<=5.0.3,>=5.0.0
-Requires-Dist: aiohttp<=3.12.15,>=3.8.0
-Requires-Dist: tqdm>=4.0.0
-Provides-Extra: cache
-Requires-Dist: aiohttp-client-cache>=0.14.0; extra == "cache"
-Provides-Extra: cache-sqlite
-Requires-Dist: aiohttp-client-cache[sqlite]; extra == "cache-sqlite"
-Provides-Extra: cache-redis
-Requires-Dist: aiohttp-client-cache[redis]; extra == "cache-redis"
-Provides-Extra: test
-Requires-Dist: pytest>=7.0; extra == "test"
-Requires-Dist: pytest-asyncio>=0.23; extra == "test"
-Provides-Extra: dev
-Requires-Dist: ruff; extra == "dev"
-Dynamic: license-file
-# **wxpath** - declarative web crawling with XPath
-[![Python 3.10+](https://img.shields.io/badge/python-3.10+-blue.svg)](https://www.python.org/downloads/release/python-3100/)
+# **wxpath** - declarative web graph traversal with XPath
+[![Python 3.10+](https://img.shields.io/badge/python-3.10+-blue.svg)](https://www.python.org/downloads/release/python-3100/) [![Documentation Status](https://img.shields.io/badge/documentation-green.svg)](https://rodricios.github.io/wxpath)
+> NEW: [TUI](https://rodricios.github.io/wxpath/tui/quickstart.md) - Interactive terminal interface (powered by Textual) for testing wxpath expressions and exporting data.
+![Wxpath TUI Demo screenshot](docs/assets/images/demo1.jpg)
+## Install
+Requires Python 3.10+.
+```
+pip install wxpath
+# For TUI support
+pip install wxpath[tui]
+```
+---
+## What is wxpath?
 **wxpath** is a declarative web crawler where traversal is expressed directly in XPath. Instead of writing imperative crawl loops, wxpath lets you describe what to follow and what to extract in a single expression. **wxpath** executes that expression concurrently, breadth-first-*ish*, and streams results as they are discovered.
@@ -35,14 +28,14 @@ This expression fetches a page, extracts links, and streams them concurrently -
 ```python
 import wxpath
-expr = "url('https://example.com')//a/@href"
+expr = "url('https://quotes.toscrape.com')//a/@href"
 for link in wxpath.wxpath_async_blocking_iter(expr):
     print(link)
 ```
-By introducing the `url(...)` operator and the `///` syntax, wxpath's engine is able to perform deep (or paginated) web crawling and extraction:
+By introducing the `url(...)` operator and the `///` syntax, wxpath's engine is able to perform recursive (or paginated) web crawling and extraction:
 ```python
 import wxpath
@@ -62,15 +55,28 @@ for item in wxpath.wxpath_async_blocking_iter(path_expr, max_depth=1):
 Most web scrapers force you to write crawl control flow first, and extraction second.
-**wxpath** inverts that:
+**wxpath** converges those two steps into one:
 - **You describe traversal declaratively**
 - **Extraction is expressed inline**
 - **The engine handles scheduling, concurrency, and deduplication**
+### RAG-Ready Output
+Extract clean, structured JSON hierarchies directly from the graph - feed your LLMs signal, not noise. Refer to [LangChain Integration](https://rodricios.github.io/wxpath/api/integrations/langchain/) for more details.
+### Deterministic
+**wxpath** is deterministic (read: not powered by LLMs). While we can't guarantee the network is stable, we can guarantee the traversal is.
+## Documentation (WIP)
+Documentation is now available [here](https://rodricios.github.io/wxpath/).
 ## Contents
-- [Example](#example)
+- [Example: Knowledge Graph](#example)
 - [Language Design](DESIGN.md)
 - [`url(...)` and `///url(...)` Explained](#url-and-url-explained)
 - [General flow](#general-flow)
@@ -80,6 +86,7 @@ Most web scrapers force you to write crawl control flow first, and extraction se
 - [XPath 3.1](#xpath-31-by-default)
 - [Progress Bar](#progress-bar)
 - [CLI](#cli)
+- [TUI](#tui)
 - [Persistence and Caching](#persistence-and-caching)
 - [Settings](#settings)
 - [Hooks (Experimental)](#hooks-experimental)
@@ -294,12 +301,17 @@ Command line options:
 --cache                [true|false] (Default: False) Persist crawl results to a local database
 ```
+## TUI
+**wxpath** provides a terminal interface (TUI) for interactive expression testing and data extraction.
+See [TUI Quickstart](https://rodricios.github.io/wxpath/tui/quickstart.md) for more details.
 ## Persistence and Caching
 **wxpath** optionally persists crawl results to a local database. This is especially useful when you're crawling a large number of URLs, and you decide to pause the crawl, change extraction expressions, or otherwise need to restart the crawl.
-**wxpath** supports two backends: sqlite and redis. SQLite is great for small-scale crawls, with a single worker (i.e., `engine.crawler.concurrency == 1`). Redis is great for large-scale crawls, with multiple workers. You will be encounter a warning if you `min(engine.crawler.concurrency, engine.crawler.per_host) > 1` when using the sqlite backend.
+**wxpath** supports two backends: sqlite and redis. SQLite is great for small-scale crawls, with a single worker (i.e., `engine.crawler.concurrency == 1`). Redis is great for large-scale crawls, with multiple workers. You will encounter a warning if `min(engine.crawler.concurrency, engine.crawler.per_host) > 1` when using the sqlite backend.
 To use, you must install the appropriate optional dependency:

{wxpath-0.4.1 → wxpath-0.5.0}/pyproject.toml RENAMED Viewed

@@ -4,7 +4,7 @@ build-backend = "setuptools.build_meta"
 [project]
 name = "wxpath"
-version = "0.4.1"
+version = "0.5.0"
 description = "wxpath - a declarative web crawler and data extractor"
 readme = "README.md"
 requires-python = ">=3.10"
@@ -13,6 +13,7 @@ authors = [
 ]
 license = "MIT"
 license-files = ["LICENSE"]
 dependencies = [
     "lxml>=4.0",
     "elementpath>=5.0.0,<=5.0.3",
@@ -20,16 +21,32 @@ dependencies = [
     "tqdm>=4.0.0"
 ]
+[project.urls]
+Homepage = "https://rodricios.github.io/wxpath"
+Documentation = "https://rodricios.github.io/wxpath"
+Repository = "https://github.com/rodricios/wxpath"
+Issues = "https://github.com/rodricios/wxpath/issues"
+Changelog = "https://github.com/rodricios/wxpath/blob/main/CHANGELOG.md"
 [project.optional-dependencies]
 cache = ["aiohttp-client-cache>=0.14.0"]
 cache-sqlite   = ["aiohttp-client-cache[sqlite]"]
 cache-redis    = ["aiohttp-client-cache[redis]"]
+# langchain langchain-ollama langchain-chroma chromadb
+llm = ["langchain>=1.0.0", "langchain-core>=1.0.0", "langchain-ollama>=1.0.0",
+       "langchain-community>=0.4.0", "langchain-chroma>=1.0.0", "chromadb>=1.0.0",
+       "langchain-text-splitters>=1.1.0"]
 test = ["pytest>=7.0", "pytest-asyncio>=0.23"]
 dev = ["ruff"]
+docs = ["mkdocs>=1.5", "mkdocs-material>=9.0", "mkdocstrings[python]>=0.24", "mkdocs-macros-plugin>=1.0", "mkdocs-resize-images>=1.0", "mkdocs-glightbox", "pyyaml>=6.0"]
+tui = ["textual>=1.0.0", "aiohttp-client-cache>=0.14.0", "aiohttp-client-cache[sqlite]"]
 [project.scripts]
 wxpath = "wxpath.cli:main"
+wxpath-tui = "wxpath.tui:main"
 [tool.pytest.ini_options]
 minversion = "6.0"

{wxpath-0.4.1 → wxpath-0.5.0}/src/wxpath/__init__.py RENAMED Viewed

@@ -1,3 +1,4 @@
+from . import settings
 from .core.runtime.engine import wxpath_async, wxpath_async_blocking, wxpath_async_blocking_iter
 from .util.logging import configure_logging
@@ -6,4 +7,5 @@ __all__ = [
     'wxpath_async_blocking',
     'wxpath_async_blocking_iter',
     'configure_logging',
+    'settings',
 ]

{wxpath-0.4.1 → wxpath-0.5.0}/src/wxpath/cli.py RENAMED Viewed

@@ -47,6 +47,11 @@ def main():
         help="Respect robots.txt",
         default=True
     )
+    arg_parser.add_argument(
+        "--insecure",
+        action="store_true",
+        help="Disable SSL certificate verification (use for sites with broken chains)",
+    )
     arg_parser.add_argument(
         "--cache",
         action="store_true",
@@ -112,6 +117,7 @@ def main():
         concurrency=args.concurrency,
         per_host=args.concurrency_per_host,
         respect_robots=args.respect_robots,
+        verify_ssl=not args.insecure,
         headers=custom_headers
     )
     engine = WXPathEngine(crawler=crawler)

{wxpath-0.4.1 → wxpath-0.5.0}/src/wxpath/core/models.py RENAMED Viewed

@@ -61,6 +61,7 @@ class InfiniteCrawlIntent(ProcessIntent):
 @dataclass(slots=True)
 class ExtractIntent(ProcessIntent):
+    """TODO: May be redundant with ProcessIntent?"""
     pass

{wxpath-0.4.1 → wxpath-0.5.0}/src/wxpath/core/ops.py RENAMED Viewed

@@ -19,6 +19,7 @@ from wxpath.core.parser import (
     Binary,
     Call,
     ContextItem,
+    Depth,
     Segment,
     Segments,
     String,
@@ -78,7 +79,10 @@ def get_operator(
 @register('url', (String,))
+@register('url', (String, Depth))
 @register('url', (String, Xpath))
+@register('url', (String, Depth, Xpath))
+@register('url', (String, Xpath, Depth))
 def _handle_url_str_lit(curr_elem: html.HtmlElement,
                         curr_segments: list[Url | Xpath],
                         curr_depth: int, **kwargs) -> Iterable[Intent]:
@@ -87,9 +91,12 @@ def _handle_url_str_lit(curr_elem: html.HtmlElement,
     next_segments = curr_segments[1:]
-    if len(url_call.args) == 2:
+    # NOTE: Expects parser to produce UrlCrawl node in expressions
+    # that look like `url('...', follow=//a/@href)`
+    if isinstance(url_call, UrlCrawl):
+        xpath_arg = [arg for arg in url_call.args if isinstance(arg, Xpath)][0]
         _segments = [
-            UrlCrawl('///url', [url_call.args[1], url_call.args[0].value])
+            UrlCrawl('///url', [xpath_arg, url_call.args[0].value])
         ] + next_segments
         yield CrawlIntent(url=url_call.args[0].value, next_segments=_segments)
@@ -112,16 +119,6 @@ def _handle_xpath(curr_elem: html.HtmlElement,
         raise ValueError("Element must be provided when path_expr does not start with 'url()'.")
     base_url = getattr(curr_elem, 'base_url', None)
     log.debug("base url", extra={"depth": curr_depth, "op": 'xpath', "base_url": base_url})
-    _backlink_str = f"string('{curr_elem.get('backlink')}')"
-    # We use the root tree's depth and not curr_depth because curr_depth accounts for a +1
-    # increment after each url*() hop
-    _depth_str = f"number({curr_elem.getroottree().getroot().get('depth')})"
-    expr = expr.replace('wx:backlink()', _backlink_str)
-    expr = expr.replace('wx:backlink(.)', _backlink_str)
-    expr = expr.replace('wx:depth()', _depth_str)
-    expr = expr.replace('wx:depth(.)', _depth_str)
     elems = curr_elem.xpath3(expr)
     next_segments = curr_segments[1:]

{wxpath-0.4.1 → wxpath-0.5.0}/src/wxpath/core/parser.py RENAMED Viewed

@@ -13,7 +13,8 @@ except ImportError:
 TOKEN_SPEC = [
-    ("NUMBER",   r"\d+(\.\d+)?"),
+    ("NUMBER",   r"\d+\.\d+"),
+    ("INTEGER",  r"\d+"),
     ("STRING",   r"'([^'\\]|\\.)*'|\"([^\"\\]|\\.)*\""), # TODO: Rename to URL Literal
     ("WXPATH",   r"/{0,3}\s*url"),  # Must come before NAME to match 'url' as WXPATH
     # ("///URL",   r"/{3}\s*url"),
@@ -22,6 +23,7 @@ TOKEN_SPEC = [
     ("URL",      r"\s*url"),  # Must come before NAME to match 'url' as WXPATH
     # ("NAME",     r"[a-zA-Z_][a-zA-Z0-9_]*"),
     ("FOLLOW",   r",?\s{,}follow="),
+    ("DEPTH",    r",?\s{,}depth="),
     ("OP",       r"\|\||<=|>=|!=|=|<|>|\+|-|\*|/|!"),  # Added || for string concat
     ("LPAREN",   r"\("),
     ("RPAREN",   r"\)"),
@@ -63,6 +65,14 @@ def tokenize(src: str):
 class Number:
     value: float
+@dataclass
+class Integer:
+    value: int
+@dataclass
+class Depth(Integer):
+    pass
 @dataclass
 class String:
     value: str
@@ -273,6 +283,10 @@ class Parser:
         if tok.type == "NUMBER":
             self.advance()
             return Number(float(tok.value))
+        if tok.type == "INTEGER":
+            self.advance()
+            return Integer(int(tok.value))
         if tok.type == "STRING":
             self.advance()
@@ -358,18 +372,18 @@ class Parser:
             self.advance()
         return result
     def capture_url_arg_content(self) -> list[Call | Xpath | ContextItem]:
         """Capture content inside a url() call, handling nested wxpath expressions.
         Supports patterns like::
-            url('...')                      -> [String]
-            url('...' follow=//a/@href)     -> [String, Xpath]
-            url(//a/@href)                  -> [Xpath]
-            url( url('..')//a/@href )       -> [Call, Xpath]
-            url( url( url('..')//a )//b )   -> [Call, Xpath]
+            url('...')                          -> [String]
+            url('...' follow=//a/@href)         -> [String, Xpath]
+            url('...' follow=//a/@href depth=2) -> [String, Xpath, Integer]
+            url(//a/@href depth=2)              -> [Xpath, Integer]
+            url( url('..')//a/@href )           -> [Call, Xpath]
+            url( url( url('..')//a )//b )       -> [Call, Xpath]
         Returns:
             A list of parsed elements: Xpath nodes for xpath content and Call
@@ -380,7 +394,10 @@ class Parser:
         paren_balance = 1  # We're already inside the opening paren of url()
         brace_balance = 0  # Track braces for map constructors
         reached_follow_token = False
+        reached_depth_token = False
         follow_xpath = ""
+        depth_number = ""
         while paren_balance > 0 and self.token.type != "EOF":
             if self.token.type == "WXPATH":
                 # Found nested wxpath: save any accumulated xpath content first
@@ -396,13 +413,22 @@ class Parser:
             elif self.token.type == "FOLLOW":
                 reached_follow_token = True
+                reached_depth_token = False
+                self.advance()
+            elif self.token.type == "DEPTH":
+                reached_depth_token = True
+                reached_follow_token = False
                 self.advance()
             elif self.token.type == "LPAREN":
                 # Opening paren that's NOT part of a url() call
                 # (it's part of an xpath function like contains(), starts-with(), etc.)
                 paren_balance += 1
-                current_xpath += self.token.value
+                if not reached_follow_token:
+                    current_xpath += self.token.value
+                else:
+                    follow_xpath += self.token.value
                 self.advance()
             elif self.token.type == "RPAREN":
@@ -410,26 +436,37 @@ class Parser:
                 if paren_balance == 0:
                     # This is the closing paren of the outer url()
                     break
-                current_xpath += self.token.value
+                if not reached_follow_token:
+                    current_xpath += self.token.value
+                else:
+                    follow_xpath += self.token.value
                 self.advance()
             elif self.token.type == "LBRACE":
                 # Opening brace for map constructors
                 brace_balance += 1
-                current_xpath += self.token.value
+                if not reached_follow_token:
+                    current_xpath += self.token.value
+                else:
+                    follow_xpath += self.token.value
                 self.advance()
             elif self.token.type == "RBRACE":
                 brace_balance -= 1
-                current_xpath += self.token.value
+                if not reached_follow_token:
+                        current_xpath += self.token.value
+                else:
+                    follow_xpath += self.token.value
                 self.advance()
             else:
                 # Accumulate all other tokens as xpath content
-                if not reached_follow_token:
-                    current_xpath += self.token.value
-                else:
+                if reached_follow_token:
                     follow_xpath += self.token.value
+                elif reached_depth_token:
+                    depth_number += self.token.value
+                else:
+                    current_xpath += self.token.value
                 self.advance()
@@ -447,6 +484,9 @@ class Parser:
         if follow_xpath.strip():
             elements.append(Xpath(follow_xpath.strip()))
+        if depth_number.strip():
+            elements.append(Depth(int(depth_number.strip())))
         return elements
     def parse_call(self, func_name: str) -> Call | Segments:
@@ -462,13 +502,16 @@ class Parser:
                 self.advance()
                 # Handle follow=...
                 if self.token.type == "FOLLOW":
-                    self.advance()
                     follow_arg = self.capture_url_arg_content()
                     args.extend(follow_arg)
+                if self.token.type == "DEPTH":
+                    depth_arg = self.capture_url_arg_content()
+                    args.extend(depth_arg)
             elif self.token.type == "WXPATH":
                 # Nested wxpath: url( url('...')//a/@href ) or url( /url(...) )
-                # Use capture_url_arg_content to handle nested wxpath and xpath
-                args = self.capture_url_arg_content()
+                # NOTE: We used to use capture_url_arg_content to handle nested wxpath and xpath
+                # args = self.capture_url_arg_content()
+                args = self.nud()
             else:
                 # Simple xpath argument: url(//a/@href)
                 # Could still contain nested wxpath, so use capture_url_arg_content
@@ -489,8 +532,18 @@ class Parser:
         return _specify_call_types(func_name, args)
 def _specify_call_types(func_name: str, args: list) -> Call | Segments:
+    """
+    Specify the type of a call based on the function name and arguments.
+    TODO: Provide example wxpath expressions for each call type.
+    Args:
+        func_name: The name of the function.
+        args: The arguments of the function.
+    Returns:
+        Call | Segments: The type of the call.
+    """
     if func_name == "url":
         if len(args) == 1:
             if isinstance(args[0], String):
@@ -500,17 +553,33 @@ def _specify_call_types(func_name: str, args: list) -> Call | Segments:
             else:
                 raise ValueError(f"Unknown argument type: {type(args[0])}")
         elif len(args) == 2:
-            if isinstance(args[0], String) and isinstance(args[1], Xpath):
+            arg0, arg1 = args
+            if isinstance(arg0, String) and isinstance(arg1, Xpath):
+                # Example: url('...', follow=//a/@href)
                 return UrlCrawl(func_name, args)
-            elif isinstance(args[0], UrlLiteral) and isinstance(args[1], Xpath):
+            elif isinstance(arg0, String) and isinstance(arg1, Integer):
+                # Example: url('...', depth=2)
+                return UrlLiteral(func_name, args)
+            elif isinstance(arg0, UrlLiteral) and isinstance(arg1, Xpath):
                 args.append(UrlQuery('url', [ContextItem()]))
                 return Segments(args)
-            elif isinstance(args[0], (Segments, list)) and isinstance(args[1], Xpath):
-                segs = args[0]
-                segs.append(args[1])
+            elif isinstance(arg0, (Segments, list)) and isinstance(arg1, Xpath):
+                segs = arg0
+                segs.append(arg1)
                 return Segments(segs)
             else:
                 raise ValueError(f"Unknown arguments: {args}")
+        elif len(args) == 3:
+            arg0, arg1, arg2 = args
+            if (isinstance(arg0, String) and (
+                (isinstance(arg1, Xpath) and isinstance(arg2, Integer)) or
+                (isinstance(arg1, Integer) and isinstance(arg2, Xpath))
+            )):
+                # Example: url('...', follow=//a/@href, depth=2)
+                # Example: url('...', depth=2, follow=//a/@href)
+                return UrlCrawl(func_name, args)
+            else:
+                raise ValueError(f"Unknown arguments: {args}")
         else:
             raise ValueError(f"Unknown arguments: {args}")
     elif func_name == "/url" or func_name == "//url":

wxpath 0.4.1__tar.gz → 0.5.0__tar.gz

wxpath 0.4.1tar.gz → 0.5.0tar.gz