wxpath 0.1.0__tar.gz

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
wxpath-0.1.0/LICENSE ADDED
@@ -0,0 +1,21 @@
1
+ MIT License
2
+
3
+ Copyright (c) 2025 Rod Palacios
4
+
5
+ Permission is hereby granted, free of charge, to any person obtaining a copy
6
+ of this software and associated documentation files (the “Software”), to deal
7
+ in the Software without restriction, including without limitation the rights
8
+ to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
9
+ copies of the Software, and to permit persons to whom the Software is
10
+ furnished to do so, subject to the following conditions:
11
+
12
+ The above copyright notice and this permission notice shall be included in all
13
+ copies or substantial portions of the Software.
14
+
15
+ THE SOFTWARE IS PROVIDED “AS IS”, WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
16
+ IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
17
+ FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
18
+ AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
19
+ LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
20
+ OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
21
+ SOFTWARE.
wxpath-0.1.0/PKG-INFO ADDED
@@ -0,0 +1,404 @@
1
+ Metadata-Version: 2.4
2
+ Name: wxpath
3
+ Version: 0.1.0
4
+ Summary: wxpath - a declarative web crawler and data extractor
5
+ Author-email: Rodrigo Palacios <rodrigopala91@gmail.com>
6
+ License-Expression: MIT
7
+ Requires-Python: >=3.9
8
+ Description-Content-Type: text/markdown
9
+ License-File: LICENSE
10
+ Requires-Dist: requests>=2.0
11
+ Requires-Dist: lxml>=4.0
12
+ Requires-Dist: elementpath>=5.0.0
13
+ Requires-Dist: aiohttp>=3.8.0
14
+ Provides-Extra: test
15
+ Requires-Dist: pytest>=7.0; extra == "test"
16
+ Requires-Dist: pytest-asyncio>=0.23; extra == "test"
17
+ Dynamic: license-file
18
+
19
+
20
+ # wxpath - declarative web crawling with XPath
21
+
22
+ **wxpath** is a declarative web crawler where traversal is expressed directly in XPath. Instead of writing imperative crawl loops, you describe what to follow and what to extract in a single expression. **wxpath** evaluates that expression concurrently, breadth-first-*ish*, and streams results as they are discovered.
23
+
24
+ By introducing the `url(...)` operator and the `///` syntax, **wxpath**'s engine is able to perform deep, recursive web crawling and extraction.
25
+
26
+ NOTE: This project is in early development. Core concepts are stable, but the API and features may change. Please report issues - in particular, deadlocked crawls or unexpected behavior - and any features you'd like to see (no guarantee they'll be implemented).
27
+
28
+ ## Contents
29
+
30
+ - [Example](#example)
31
+ - [`url(...)` and `///` Explained](#url-and---explained)
32
+ - [General flow](#general-flow)
33
+ - [Asynchronous Crawling](#asynchronous-crawling)
34
+ - [Output types](#output-types)
35
+ - [XPath 3.1 support](#xpath-31-support)
36
+ - [CLI](#cli)
37
+ - [Hooks (Experimental)](#hooks-experimental)
38
+ - [Install](#install)
39
+ - [More Examples](#more-examples)
40
+ - [Advanced: Engine & Crawler Configuration](#advanced-engine--crawler-configuration)
41
+ - [Project Philosophy](#project-philosophy)
42
+ - [Warnings](#warnings)
43
+ - [License](#license)
44
+
45
+ ## Example
46
+
47
+ ```python
48
+ import wxpath
49
+
50
+ path = """
51
+ url('https://en.wikipedia.org/wiki/Expression_language')
52
+ ///main//a/@href[starts-with(., '/wiki/') and not(contains(., ':'))]/url(.)
53
+ /map{
54
+ 'title':(//span[contains(@class, "mw-page-title-main")]/text())[1],
55
+ 'url':string(base-uri(.)),
56
+ 'short_description':(//div[contains(@class, 'shortdescription')]/text())[1]
57
+ }
58
+ """
59
+
60
+ for item in wxpath.wxpath_async_blocking_iter(path, max_depth=1):
61
+ print(item)
62
+ ```
63
+
64
+ Output:
65
+
66
+ ```python
67
+ map{'title': TextNode('Computer language'), 'url': 'https://en.wikipedia.org/wiki/Computer_language', 'short_description': TextNode('Formal language for communicating with a computer')}
68
+ map{'title': TextNode('Machine-readable medium and data'), 'url': 'https://en.wikipedia.org/wiki/Machine_readable', 'short_description': TextNode('Medium capable of storing data in a format readable by a machine')}
69
+ map{'title': TextNode('Advanced Boolean Expression Language'), 'url': 'https://en.wikipedia.org/wiki/Advanced_Boolean_Expression_Language', 'short_description': TextNode('Hardware description language and software')}
70
+ map{'title': TextNode('Jakarta Expression Language'), 'url': 'https://en.wikipedia.org/wiki/Jakarta_Expression_Language', 'short_description': TextNode('Computer programming language')}
71
+ map{'title': TextNode('Data Analysis Expressions'), 'url': 'https://en.wikipedia.org/wiki/Data_Analysis_Expressions', 'short_description': TextNode('Formula and data query language')}
72
+ map{'title': TextNode('Domain knowledge'), 'url': 'https://en.wikipedia.org/wiki/Domain_knowledge', 'short_description': TextNode('Specialist knowledge within a specific field')}
73
+ map{'title': TextNode('Rights Expression Language'), 'url': 'https://en.wikipedia.org/wiki/Rights_Expression_Language', 'short_description': TextNode('Machine-processable language used to express intellectual property rights (such as copyright)')}
74
+ map{'title': TextNode('Computer science'), 'url': 'https://en.wikipedia.org/wiki/Computer_science', 'short_description': TextNode('Study of computation')}
75
+ ```
76
+
77
+ The above expression does the following:
78
+
79
+ 1. Starts at the specified URL, `https://en.wikipedia.org/wiki/Expression_language`.
80
+ 2. Filters for links in the `<main>` section that start with `/wiki/` and do not contain a colon (`:`).
81
+ 3. For each link found,
82
+ * it follows the link and extracts the title, URL, and short description of the page.
83
+ * it repeats step 2 until the maximum depth is reached.
84
+ 4. Streams the extracted data as it is discovered.
85
+
86
+
87
+ ## `url(...)` and `///` Explained
88
+
89
+ - `url(...)` is a custom operator that fetches the content of the user-specified or internally generated URL and returns it as an `lxml.html.HtmlElement` for further XPath processing.
90
+ - `///` indicates infinite/recursive traversal. It tells **wxpath** to continue following links indefinitely, up to the specified `max_depth`. Unlike repeated `url()` hops, it allows a single expression to describe unbounded graph exploration. WARNING: Use with caution and constraints (via `max_depth` or XPath predicates) to avoid traversal explosion.
91
+
92
+ ## General flow
93
+
94
+ **wxpath** evaluates an expression as a list of traversal and extraction steps (internally referred to as `Segment`s).
95
+
96
+ `url(...)` creates crawl tasks either statically (via a fixed URL) or dynamically (via a URL derived from the XPath expression). **URLs are deduplicated globally, not per-depth and on a best-effort basis**.
97
+
98
+ XPath segments operate on fetched documents (fetched via the immediately preceding `url(...)` operations).
99
+
100
+ `///` indicates infinite/recursive traversal - it proceeds breadth-first-*ish* up to `max_depth`.
101
+
102
+ Results are yielded as soon as they are ready.
103
+
104
+
105
+ ## Asynchronous Crawling
106
+
107
+
108
+ **wxpath** is `asyncio/aiohttp`-first, providing an asynchronous API for crawling and extracting data.
109
+
110
+ ```python
111
+ import asyncio
112
+ from wxpath import wxpath_async
113
+
114
+ items = []
115
+
116
+ async def main():
117
+ path_expr = "url('https://en.wikipedia.org/wiki/Expression_language')///url(@href[starts-with(., '/wiki/')])//a/@href"
118
+ async for item in wxpath_async(path_expr, max_depth=1):
119
+ items.append(item)
120
+
121
+ asyncio.run(main())
122
+ ```
123
+
124
+ ### Blocking, Concurrent Requests
125
+
126
+
127
+ **wxpath** also supports concurrent requests using an asyncio-in-sync pattern, allowing you to crawl multiple pages concurrently while maintaining the simplicity of synchronous code. This is particularly useful for crawls in strictly synchronous execution environments (i.e., not inside an `asyncio` event loop) where performance is a concern.
128
+
129
+ ```python
130
+ from wxpath import wxpath_async_blocking_iter
131
+
132
+ path_expr = "url('https://en.wikipedia.org/wiki/Expression_language')///url(@href[starts-with(., '/wiki/')])//a/@href"
133
+ items = list(wxpath_async_blocking_iter(path_expr, max_depth=1))
134
+ ```
135
+
136
+ ## Output types
137
+
138
+ The wxpath Python API yields structured objects, not just strings.
139
+
140
+ Depending on the expression, results may include:
141
+
142
+ - `lxml.*` and `lxml.html.*` objects
143
+ - `elementpath.datatypes.*` objects (for XPath 3.1 features)
144
+ - `WxStr` (string values with provenance)
145
+ - dictionaries / maps
146
+ - lists or other XPath-native values
147
+
148
+ The CLI flattens these objects into plain JSON for display.
149
+ The Python API preserves structure by default.
150
+
151
+
152
+ ## XPath 3.1 By Default
153
+
154
+ **wxpath** uses the `elementpath` library to provide XPath 3.1 support, enabling advanced XPath features like **maps**, **arrays**, and more. This allows you to write more powerful XPath queries.
155
+
156
+ ```python
157
+ path_expr = """
158
+ url('https://en.wikipedia.org/wiki/Expression_language')
159
+ ///div[@id='mw-content-text']//a/url(@href)
160
+ /map{
161
+ 'title':(//span[contains(@class, "mw-page-title-main")]/text())[1],
162
+ 'short_description':(//div[contains(@class, "shortdescription")]/text())[1],
163
+ 'url'://link[@rel='canonical']/@href[1]
164
+ }
165
+ """
166
+ # [...
167
+ # {'title': 'Computer language',
168
+ # 'short_description': 'Formal language for communicating with a computer',
169
+ # 'url': 'https://en.wikipedia.org/wiki/Computer_language'},
170
+ # {'title': 'Machine-readable medium and data',
171
+ # 'short_description': 'Medium capable of storing data in a format readable by a machine',
172
+ # 'url': 'https://en.wikipedia.org/wiki/Machine-readable_medium_and_data'},
173
+ # {'title': 'Domain knowledge',
174
+ # 'short_description': 'Specialist knowledge within a specific field',
175
+ # 'url': 'https://en.wikipedia.org/wiki/Domain_knowledge'},
176
+ # ...]
177
+ ```
178
+
179
+ ## CLI
180
+
181
+ **wxpath** provides a command-line interface (CLI) to quickly experiment and execute wxpath expressions directly from the terminal.
182
+
183
+ ```bash
184
+ > wxpath --depth 1 "\
185
+ url('https://en.wikipedia.org/wiki/Expression_language')\
186
+ ///div[@id='mw-content-text'] \
187
+ //a/url(@href[starts-with(., '/wiki/') \
188
+ and not(matches(@href, '^(?:/wiki/)?(?:Wikipedia|File|Template|Special|Template_talk|Help):'))]) \
189
+ /map{ \
190
+ 'title':(//span[contains(@class, 'mw-page-title-main')]/text())[1], \
191
+ 'short_description':(//div[contains(@class, 'shortdescription')]/text())[1], \
192
+ 'url':string(base-uri(.)), \
193
+ 'backlink':wx:backlink(.), \
194
+ 'depth':wx:depth(.) \
195
+ }"
196
+
197
+ {"title": "Computer language", "short_description": "Formal language for communicating with a computer", "url": "https://en.wikipedia.org/wiki/Computer_language", "backlink": "https://en.wikipedia.org/wiki/Expression_language", "depth": 1.0}
198
+ {"title": "Machine-readable medium and data", "short_description": "Medium capable of storing data in a format readable by a machine", "url": "https://en.wikipedia.org/wiki/Machine_readable", "backlink": "https://en.wikipedia.org/wiki/Expression_language", "depth": 1.0}
199
+ {"title": "Domain knowledge", "short_description": "Specialist knowledge within a specific field", "url": "https://en.wikipedia.org/wiki/Domain_knowledge", "backlink": "https://en.wikipedia.org/wiki/Expression_language", "depth": 1.0}
200
+ {"title": "Advanced Boolean Expression Language", "short_description": "Hardware description language and software", "url": "https://en.wikipedia.org/wiki/Advanced_Boolean_Expression_Language", "backlink": "https://en.wikipedia.org/wiki/Expression_language", "depth": 1.0}
201
+ {"title": "Data Analysis Expressions", "short_description": "Formula and data query language", "url": "https://en.wikipedia.org/wiki/Data_Analysis_Expressions", "backlink": "https://en.wikipedia.org/wiki/Expression_language", "depth": 1.0}
202
+ {"title": "Jakarta Expression Language", "short_description": "Computer programming language", "url": "https://en.wikipedia.org/wiki/Jakarta_Expression_Language", "backlink": "https://en.wikipedia.org/wiki/Expression_language", "depth": 1.0}
203
+ {"title": "Rights Expression Language", "short_description": [], "url": "https://en.wikipedia.org/wiki/Rights_Expression_Language", "backlink": "https://en.wikipedia.org/wiki/Expression_language", "depth": 1.0}
204
+ {"title": "Computer science", "short_description": "Study of computation", "url": "https://en.wikipedia.org/wiki/Computer_science", "backlink": "https://en.wikipedia.org/wiki/Expression_language", "depth": 1.0}
205
+ ```
206
+
207
+
208
+ ## Hooks (Experimental)
209
+
210
+ **wxpath** supports a pluggable hook system that allows you to modify the crawling and extraction behavior. You can register hooks to preprocess URLs, post-process HTML, filter extracted values, and more. Hooks will be executed in the order they are registered. Hooks may impact performance.
211
+
212
+ ```python
213
+
214
+ from wxpath import hooks
215
+
216
+ @hooks.register
217
+ class OnlyEnglish:
218
+ def post_parse(self, ctx, elem):
219
+ lang = elem.xpath('string(/html/@lang)').lower()[:2]
220
+ return elem if lang in ("en", "") else None
221
+ ```
222
+
223
+ ### Async usage
224
+
225
+ NOTE: Hooks may be synchronous or asynchronous, but all hooks in a project should follow the same style.
226
+ Mixing sync and async hooks is not supported and may lead to unexpected behavior.
227
+
228
+ ```python
229
+
230
+ from wxpath import hooks
231
+
232
+ @hooks.register
233
+ class OnlyEnglish:
234
+ async def post_parse(self, ctx, elem):
235
+ lang = elem.xpath('string(/html/@lang)').lower()[:2]
236
+ return elem if lang in ("en", "") else None
237
+
238
+ ```
239
+
240
+ ### Predefined Hooks
241
+
242
+ `JSONLWriter` (aliased `NDJSONWriter`) is a built-in hook that writes extracted data to a newline-delimited JSON file. This is useful for storing results in a structured format that can be easily processed later.
243
+
244
+ ```python
245
+ from wxpath import hooks
246
+ hooks.register(hooks.JSONLWriter)
247
+ ```
248
+
249
+
250
+ ## Install
251
+
252
+ ```
253
+ pip install wxpath
254
+ ```
255
+
256
+
257
+ ## More Examples
258
+
259
+ ```python
260
+ import wxpath
261
+
262
+ #### EXAMPLE 1 - Simple, single page crawl and link extraction #######
263
+ #
264
+ # Starting from Expression language's wiki, extract all links (hrefs)
265
+ # from the main section. The `url(...)` operator is used to execute a
266
+ # web request to the specified URL and return the HTML content.
267
+ #
268
+ path_expr = "url('https://en.wikipedia.org/wiki/Expression_language')//main//a/@href"
269
+
270
+ items = wxpath.wxpath_async_blocking(path_expr)
271
+
272
+
273
+ #### EXAMPLE 2 - Two-deep crawl and link extraction ##################
274
+ #
275
+ # Starting from Expression language's wiki, crawl all child links
276
+ # starting with '/wiki/', and extract each child's links (hrefs). The
277
+ # `url(...)` operator is pipe'd arguments from the evaluated XPath.
278
+ #
279
+ path_expr = "url('https://en.wikipedia.org/wiki/Expression_language')//url(@href[starts-with(., '/wiki/')])//a/@href"
280
+
281
+ #### EXAMPLE 3 - Infinite crawl with BFS tree depth limit ############
282
+ #
283
+ # Starting from Expression language's wiki, infinitely crawl all child
284
+ # links (and child's child's links recursively). The `///` syntax is
285
+ # used to indicate an infinite crawl.
286
+ # Returns lxml.html.HtmlElement objects.
287
+ #
288
+ path_expr = "url('https://en.wikipedia.org/wiki/Expression_language')///main//a/url(@href)"
289
+
290
+ # The same expression written differently:
291
+ path_expr = "url('https://en.wikipedia.org/wiki/Expression_language')///url(//main//a/@href)"
292
+
293
+ # Modify (inclusive) max_depth to limit the BFS tree (crawl depth).
294
+ items = wxpath.wxpath_async_blocking(path_expr, max_depth=1)
295
+
296
+ #### EXAMPLE 4 - Infinite crawl with field extraction ################
297
+ #
298
+ # Infinitely crawls Expression language's wiki's child links and
299
+ # childs' child links (recursively) and then, for each child link
300
+ # crawled, extracts objects with the named fields as a dict.
301
+ #
302
+ path_expr = """
303
+ url('https://en.wikipedia.org/wiki/Expression_language')
304
+ ///main//a/url(@href)
305
+ /map {
306
+ 'title':(//span[contains(@class, "mw-page-title-main")]/text())[1],
307
+ 'short_description':(//div[contains(@class, "shortdescription")]/text())[1],
308
+ 'url'://link[@rel='canonical']/@href[1],
309
+ 'backlink':wx:backlink(.),
310
+ 'depth':wx:depth(.)
311
+ }
312
+ """
313
+
314
+ # Under the hood of wxpath.core.wxpath, we generate `segments` list,
315
+ # revealing the operations executed to accomplish the crawl.
316
+ # >> segments = wxpath.core.parser.parse_wxpath_expr(path_expr);
317
+ # >> segments
318
+ # [Segment(op='url', value='https://en.wikipedia.org/wiki/Expression_language'),
319
+ # Segment(op='url_inf', value='///url(//main//a/@href)'),
320
+ # Segment(op='xpath', value='/map { \'title\':(//span[contains(@class, "mw-page-title-main")]/text())[1], \'short_description\':(//div[contains(@class, "shortdescription")]/text())[1], \'url\'://link[@rel=\'canonical\']/@href[1] }')]
321
+
322
+ #### EXAMPLE 5 = Seeding from XPath function expression + mapping operator (`!`)
323
+ #
324
+ # Functionally create 10 Amazon book search result page URLs, map each URL to
325
+ # the url(.) operator, and for each page, extract the title, price, and link of
326
+ # each book listed.
327
+ #
328
+ base_url = "https://www.amazon.com/s?k=books&i=stripbooks&page="
329
+
330
+ path_expr = f"""
331
+ (1 to 10) ! ('{base_url}' || .) !
332
+ url(.)
333
+ //span[@data-component-type='s-search-results']//*[@role='listitem']
334
+ /map {{
335
+ 'title': (.//h2/span/text())[1],
336
+ 'price': (.//span[@class='a-price']/span[@class='a-offscreen']/text())[1],
337
+ 'link': (.//a[@aria-describedby='price-link']/@href)[1]
338
+ }}
339
+ """
340
+
341
+ items = list(wxpath.wxpath_async_blocking_iter(path_expr, max_depth=1))
342
+ ```
343
+
344
+ ## Advanced: Engine & Crawler Configuration
345
+
346
+ You can alter the engine and crawler's behavior like so:
347
+
348
+ ```python
349
+ from wxpath import wxpath_async_blocking_iter
350
+ from wxpath.core.runtime import WXPathEngine
351
+ from wxpath.http.client.crawler import Crawler
352
+
353
+ crawler = Crawler(
354
+ concurrency=8,
355
+ per_host=2,
356
+ timeout=10,
357
+ )
358
+
359
+ # If `crawler` is not specified, a default Crawler will be created with
360
+ # the provided concurrency and per_host values, or with defaults.
361
+ engine = WXPathEngine(
362
+ # concurrency=16,
363
+ # per_host=8,
364
+ crawler=crawler,
365
+ )
366
+
367
+ path_expr = "url('https://en.wikipedia.org/wiki/Expression_language')///main//a/url(@href)"
368
+
369
+ items = list(wxpath_async_blocking_iter(path_expr, max_depth=1, engine=engine))
370
+ ```
371
+
372
+
373
+ ## Project Philosophy
374
+
375
+ ### Principles
376
+
377
+ - Enable declarative, recursive scraping without boilerplate
378
+ - Stay lightweight and composable
379
+ - Asynchronous support for high-performance crawls
380
+
381
+ ### Guarantees/Goals
382
+
383
+ - URLs are deduplicated on a best-effort, per-crawl basis.
384
+ - Crawls are intended to terminate once the frontier is exhausted or `max_depth` is reached.
385
+ - Requests are performed concurrently.
386
+ - Results are streamed as soon as they are available.
387
+
388
+ ### Non-Goals/Limitations (for now)
389
+
390
+ - Strict result ordering
391
+ - Persistent scheduling or crawl resumption
392
+ - Automatic proxy rotation
393
+ - Browser-based rendering (JavaScript execution)
394
+
395
+ ## WARNINGS!!!
396
+
397
+ - Be respectful when crawling websites. A scrapy-inspired throttler is enabled by default.
398
+ - Recursive (`///`) crawls require user discipline to avoid unbounded expansion (traversal explosion).
399
+ - Deadlocks and hangs are possible in certain situations (e.g., all tasks waiting on blocked requests). Please report issues if you encounter such behavior.
400
+ - Consider using timeouts, `max_depth`, and XPath predicates and filters to limit crawl scope.
401
+
402
+ ## License
403
+
404
+ MIT