ultimate-pi 0.1.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
Files changed (51) hide show
  1. package/.agents/skills/caveman/SKILL.md +67 -0
  2. package/.agents/skills/compress/SKILL.md +111 -0
  3. package/.agents/skills/compress/scripts/__init__.py +9 -0
  4. package/.agents/skills/compress/scripts/__main__.py +3 -0
  5. package/.agents/skills/compress/scripts/benchmark.py +78 -0
  6. package/.agents/skills/compress/scripts/cli.py +73 -0
  7. package/.agents/skills/compress/scripts/compress.py +227 -0
  8. package/.agents/skills/compress/scripts/detect.py +121 -0
  9. package/.agents/skills/compress/scripts/validate.py +189 -0
  10. package/.agents/skills/context7-cli/SKILL.md +73 -0
  11. package/.agents/skills/context7-cli/references/docs.md +121 -0
  12. package/.agents/skills/context7-cli/references/setup.md +43 -0
  13. package/.agents/skills/context7-cli/references/skills.md +118 -0
  14. package/.agents/skills/emil-design-eng/SKILL.md +679 -0
  15. package/.agents/skills/lean-ctx/SKILL.md +149 -0
  16. package/.agents/skills/lean-ctx/scripts/install.sh +95 -0
  17. package/.agents/skills/scrapling-official/LICENSE.txt +28 -0
  18. package/.agents/skills/scrapling-official/SKILL.md +390 -0
  19. package/.agents/skills/scrapling-official/examples/01_fetcher_session.py +26 -0
  20. package/.agents/skills/scrapling-official/examples/02_dynamic_session.py +26 -0
  21. package/.agents/skills/scrapling-official/examples/03_stealthy_session.py +26 -0
  22. package/.agents/skills/scrapling-official/examples/04_spider.py +58 -0
  23. package/.agents/skills/scrapling-official/examples/README.md +45 -0
  24. package/.agents/skills/scrapling-official/references/fetching/choosing.md +78 -0
  25. package/.agents/skills/scrapling-official/references/fetching/dynamic.md +352 -0
  26. package/.agents/skills/scrapling-official/references/fetching/static.md +432 -0
  27. package/.agents/skills/scrapling-official/references/fetching/stealthy.md +255 -0
  28. package/.agents/skills/scrapling-official/references/mcp-server.md +214 -0
  29. package/.agents/skills/scrapling-official/references/migrating_from_beautifulsoup.md +86 -0
  30. package/.agents/skills/scrapling-official/references/parsing/adaptive.md +212 -0
  31. package/.agents/skills/scrapling-official/references/parsing/main_classes.md +586 -0
  32. package/.agents/skills/scrapling-official/references/parsing/selection.md +494 -0
  33. package/.agents/skills/scrapling-official/references/spiders/advanced.md +344 -0
  34. package/.agents/skills/scrapling-official/references/spiders/architecture.md +94 -0
  35. package/.agents/skills/scrapling-official/references/spiders/getting-started.md +164 -0
  36. package/.agents/skills/scrapling-official/references/spiders/proxy-blocking.md +235 -0
  37. package/.agents/skills/scrapling-official/references/spiders/requests-responses.md +196 -0
  38. package/.agents/skills/scrapling-official/references/spiders/sessions.md +205 -0
  39. package/.github/banner.png +0 -0
  40. package/.pi/SYSTEM.md +40 -0
  41. package/.pi/settings.json +5 -0
  42. package/PLAN.md +11 -0
  43. package/README.md +58 -0
  44. package/extensions/lean-ctx-enforce.ts +166 -0
  45. package/package.json +17 -0
  46. package/skills-lock.json +35 -0
  47. package/wiki/README.md +10 -0
  48. package/wiki/decisions/0001-establish-project-wiki-and-decision-record-format.md +25 -0
  49. package/wiki/decisions/0002-add-project-banner-to-readme.md +26 -0
  50. package/wiki/decisions/0003-remove-redundant-readme-title-heading.md +26 -0
  51. package/wiki/decisions/0004-publish-package-to-npm-as-ultimate-pi.md +26 -0
@@ -0,0 +1,586 @@
1
+ # Parsing main classes
2
+
3
+ The [Selector](#selector) class is the core parsing engine in Scrapling, providing HTML parsing and element selection capabilities. You can always import it with any of the following imports
4
+ ```python
5
+ from scrapling import Selector
6
+ from scrapling.parser import Selector
7
+ ```
8
+ Usage:
9
+ ```python
10
+ page = Selector(
11
+ '<html>...</html>',
12
+ url='https://example.com'
13
+ )
14
+
15
+ # Then select elements as you like
16
+ elements = page.css('.product')
17
+ ```
18
+ In Scrapling, the main object you deal with after passing an HTML source or fetching a website is, of course, a [Selector](#selector) object. Any operation you do, like selection, navigation, etc., will return either a [Selector](#selector) object or a [Selectors](#selectors) object, given that the result is element/elements from the page, not text or similar.
19
+
20
+ The main page is a [Selector](#selector) object, and the elements within are [Selector](#selector) objects. Any text (text content inside elements or attribute values) is a [TextHandler](#texthandler) object, and element attributes are stored as [AttributesHandler](#attributeshandler).
21
+
22
+ ## Selector
23
+ ### Arguments explained
24
+ The most important one is `content`, it's used to pass the HTML code you want to parse, and it accepts the HTML content as `str` or `bytes`.
25
+
26
+ The arguments `url`, `adaptive`, `storage`, and `storage_args` are settings used with the `adaptive` feature. They are explained in the [adaptive](adaptive.md) feature page.
27
+
28
+ Arguments for parsing adjustments:
29
+
30
+ - **encoding**: This is the encoding that will be used while parsing the HTML. The default is `UTF-8`.
31
+ - **keep_comments**: This tells the library whether to keep HTML comments while parsing the page. It's disabled by default because it can cause issues with your scraping in various ways.
32
+ - **keep_cdata**: Same logic as the HTML comments. [cdata](https://stackoverflow.com/questions/7092236/what-is-cdata-in-html) is removed by default for cleaner HTML.
33
+
34
+ The arguments `huge_tree` and `root` are advanced features not covered here.
35
+
36
+ Most properties on the main page and its elements are lazily loaded (not initialized until accessed), which contributes to Scrapling's speed.
37
+
38
+ ### Properties
39
+ Properties for traversal are separated in the [traversal](#traversal) section below.
40
+
41
+ Parsing this HTML page as an example:
42
+ ```html
43
+ <html>
44
+ <head>
45
+ <title>Some page</title>
46
+ </head>
47
+ <body>
48
+ <div class="product-list">
49
+ <article class="product" data-id="1">
50
+ <h3>Product 1</h3>
51
+ <p class="description">This is product 1</p>
52
+ <span class="price">$10.99</span>
53
+ <div class="hidden stock">In stock: 5</div>
54
+ </article>
55
+
56
+ <article class="product" data-id="2">
57
+ <h3>Product 2</h3>
58
+ <p class="description">This is product 2</p>
59
+ <span class="price">$20.99</span>
60
+ <div class="hidden stock">In stock: 3</div>
61
+ </article>
62
+
63
+ <article class="product" data-id="3">
64
+ <h3>Product 3</h3>
65
+ <p class="description">This is product 3</p>
66
+ <span class="price">$15.99</span>
67
+ <div class="hidden stock">Out of stock</div>
68
+ </article>
69
+ </div>
70
+
71
+ <script id="page-data" type="application/json">
72
+ {
73
+ "lastUpdated": "2024-09-22T10:30:00Z",
74
+ "totalProducts": 3
75
+ }
76
+ </script>
77
+ </body>
78
+ </html>
79
+ ```
80
+ Load the page directly as shown before:
81
+ ```python
82
+ from scrapling import Selector
83
+ page = Selector(html_doc)
84
+ ```
85
+ Get all text content on the page recursively
86
+ ```python
87
+ >>> page.get_all_text()
88
+ 'Some page\n\n \n\n \nProduct 1\nThis is product 1\n$10.99\nIn stock: 5\nProduct 2\nThis is product 2\n$20.99\nIn stock: 3\nProduct 3\nThis is product 3\n$15.99\nOut of stock'
89
+ ```
90
+ Get the first article (used as an example throughout):
91
+ ```python
92
+ article = page.find('article')
93
+ ```
94
+ With the same logic, get all text content on the element recursively
95
+ ```python
96
+ >>> article.get_all_text()
97
+ 'Product 1\nThis is product 1\n$10.99\nIn stock: 5'
98
+ ```
99
+ But if you try to get the direct text content, it will be empty because it doesn't have direct text in the HTML code above
100
+ ```python
101
+ >>> article.text
102
+ ''
103
+ ```
104
+ The `get_all_text` method has the following optional arguments:
105
+
106
+ 1. **separator**: All strings collected will be concatenated using this separator. The default is '\n'.
107
+ 2. **strip**: If enabled, strings will be stripped before concatenation. Disabled by default.
108
+ 3. **ignore_tags**: A tuple of all tag names you want to ignore in the final results and ignore any elements nested within them. The default is `('script', 'style',)`.
109
+ 4. **valid_values**: If enabled, the method will only collect elements with real values, so all elements with empty text content or only whitespaces will be ignored. It's enabled by default
110
+
111
+ The text returned is a [TextHandler](#texthandler), not a standard string. If the text content can be serialized to JSON, use `.json()` on it:
112
+ ```python
113
+ >>> script = page.find('script')
114
+ >>> script.json()
115
+ {'lastUpdated': '2024-09-22T10:30:00Z', 'totalProducts': 3}
116
+ ```
117
+ Let's continue to get the element tag
118
+ ```python
119
+ >>> article.tag
120
+ 'article'
121
+ ```
122
+ Using it on the page directly operates on the root `html` element:
123
+ ```python
124
+ >>> page.tag
125
+ 'html'
126
+ ```
127
+ Getting the attributes of the element
128
+ ```python
129
+ >>> print(article.attrib)
130
+ {'class': 'product', 'data-id': '1'}
131
+ ```
132
+ Access a specific attribute with any of the following
133
+ ```python
134
+ >>> article.attrib['class']
135
+ >>> article.attrib.get('class')
136
+ >>> article['class'] # new in v0.3
137
+ ```
138
+ Check if the attributes contain a specific attribute with any of the methods below
139
+ ```python
140
+ >>> 'class' in article.attrib
141
+ >>> 'class' in article # new in v0.3
142
+ ```
143
+ Get the HTML content of the element
144
+ ```python
145
+ >>> article.html_content
146
+ '<article class="product" data-id="1"><h3>Product 1</h3>\n <p class="description">This is product 1</p>\n <span class="price">$10.99</span>\n <div class="hidden stock">In stock: 5</div>\n </article>'
147
+ ```
148
+ Get the prettified version of the element's HTML content
149
+ ```python
150
+ print(article.prettify())
151
+ ```
152
+ ```html
153
+ <article class="product" data-id="1"><h3>Product 1</h3>
154
+ <p class="description">This is product 1</p>
155
+ <span class="price">$10.99</span>
156
+ <div class="hidden stock">In stock: 5</div>
157
+ </article>
158
+ ```
159
+ Use the `.body` property to get the raw content of the page. Starting from v0.4, when used on a `Response` object from fetchers, `.body` always returns `bytes`.
160
+ ```python
161
+ >>> page.body
162
+ '<html>\n <head>\n <title>Some page</title>\n </head>\n ...'
163
+ ```
164
+ To get all the ancestors in the DOM tree of this element
165
+ ```python
166
+ >>> article.path
167
+ [<data='<div class="product-list"> <article clas...' parent='<body> <div class="product-list"> <artic...'>,
168
+ <data='<body> <div class="product-list"> <artic...' parent='<html><head><title>Some page</title></he...'>,
169
+ <data='<html><head><title>Some page</title></he...'>]
170
+ ```
171
+ Generate a CSS shortened selector if possible, or generate the full selector
172
+ ```python
173
+ >>> article.generate_css_selector
174
+ 'body > div > article'
175
+ >>> article.generate_full_css_selector
176
+ 'body > div > article'
177
+ ```
178
+ Same case with XPath
179
+ ```python
180
+ >>> article.generate_xpath_selector
181
+ "//body/div/article"
182
+ >>> article.generate_full_xpath_selector
183
+ "//body/div/article"
184
+ ```
185
+
186
+ ### Traversal
187
+ Properties and methods for navigating elements on the page.
188
+
189
+ The `html` element is the root of the website's tree. Elements like `head` and `body` are "children" of `html`, and `html` is their "parent". The element `body` is a "sibling" of `head` and vice versa.
190
+
191
+ Accessing the parent of an element
192
+ ```python
193
+ >>> article.parent
194
+ <data='<div class="product-list"> <article clas...' parent='<body> <div class="product-list"> <artic...'>
195
+ >>> article.parent.tag
196
+ 'div'
197
+ ```
198
+ Chaining is supported, as with all similar properties/methods:
199
+ ```python
200
+ >>> article.parent.parent.tag
201
+ 'body'
202
+ ```
203
+ Get the children of an element
204
+ ```python
205
+ >>> article.children
206
+ [<data='<h3>Product 1</h3>' parent='<article class="product" data-id="1"><h3...'>,
207
+ <data='<p class="description">This is product 1...' parent='<article class="product" data-id="1"><h3...'>,
208
+ <data='<span class="price">$10.99</span>' parent='<article class="product" data-id="1"><h3...'>,
209
+ <data='<div class="hidden stock">In stock: 5</d...' parent='<article class="product" data-id="1"><h3...'>]
210
+ ```
211
+ Get all elements underneath an element. It acts as a nested version of the `children` property
212
+ ```python
213
+ >>> article.below_elements
214
+ [<data='<h3>Product 1</h3>' parent='<article class="product" data-id="1"><h3...'>,
215
+ <data='<p class="description">This is product 1...' parent='<article class="product" data-id="1"><h3...'>,
216
+ <data='<span class="price">$10.99</span>' parent='<article class="product" data-id="1"><h3...'>,
217
+ <data='<div class="hidden stock">In stock: 5</d...' parent='<article class="product" data-id="1"><h3...'>]
218
+ ```
219
+ This element returns the same result as the `children` property because its children don't have children.
220
+
221
+ Another example of using the element with the `product-list` class will clear the difference between the `children` property and the `below_elements` property
222
+ ```python
223
+ >>> products_list = page.css('.product-list')[0]
224
+ >>> products_list.children
225
+ [<data='<article class="product" data-id="1"><h3...' parent='<div class="product-list"> <article clas...'>,
226
+ <data='<article class="product" data-id="2"><h3...' parent='<div class="product-list"> <article clas...'>,
227
+ <data='<article class="product" data-id="3"><h3...' parent='<div class="product-list"> <article clas...'>]
228
+
229
+ >>> products_list.below_elements
230
+ [<data='<article class="product" data-id="1"><h3...' parent='<div class="product-list"> <article clas...'>,
231
+ <data='<h3>Product 1</h3>' parent='<article class="product" data-id="1"><h3...'>,
232
+ <data='<p class="description">This is product 1...' parent='<article class="product" data-id="1"><h3...'>,
233
+ <data='<span class="price">$10.99</span>' parent='<article class="product" data-id="1"><h3...'>,
234
+ <data='<div class="hidden stock">In stock: 5</d...' parent='<article class="product" data-id="1"><h3...'>,
235
+ <data='<article class="product" data-id="2"><h3...' parent='<div class="product-list"> <article clas...'>,
236
+ ...]
237
+ ```
238
+ Get the siblings of an element
239
+ ```python
240
+ >>> article.siblings
241
+ [<data='<article class="product" data-id="2"><h3...' parent='<div class="product-list"> <article clas...'>,
242
+ <data='<article class="product" data-id="3"><h3...' parent='<div class="product-list"> <article clas...'>]
243
+ ```
244
+ Get the next element of the current element
245
+ ```python
246
+ >>> article.next
247
+ <data='<article class="product" data-id="2"><h3...' parent='<div class="product-list"> <article clas...'>
248
+ ```
249
+ The same logic applies to the `previous` property
250
+ ```python
251
+ >>> article.previous # It's the first child, so it doesn't have a previous element
252
+ >>> second_article = page.css('.product[data-id="2"]')[0]
253
+ >>> second_article.previous
254
+ <data='<article class="product" data-id="1"><h3...' parent='<div class="product-list"> <article clas...'>
255
+ ```
256
+ Check if an element has a specific class name:
257
+ ```python
258
+ >>> article.has_class('product')
259
+ True
260
+ ```
261
+ Iterate over the entire ancestors' tree of any element:
262
+ ```python
263
+ for ancestor in article.iterancestors():
264
+ # do something with it...
265
+ ```
266
+ Search for a specific ancestor that satisfies a search function. Pass a function that takes a [Selector](#selector) object as an argument and returns `True`/`False`:
267
+ ```python
268
+ >>> article.find_ancestor(lambda ancestor: ancestor.has_class('product-list'))
269
+ <data='<div class="product-list"> <article clas...' parent='<body> <div class="product-list"> <artic...'>
270
+
271
+ >>> article.find_ancestor(lambda ancestor: ancestor.css('.product-list')) # Same result, different approach
272
+ <data='<div class="product-list"> <article clas...' parent='<body> <div class="product-list"> <artic...'>
273
+ ```
274
+ ## Selectors
275
+ The class `Selectors` is the "List" version of the [Selector](#selector) class. It inherits from the Python standard `List` type, so it shares all `List` properties and methods while adding more methods to make the operations you want to execute on the [Selector](#selector) instances within more straightforward.
276
+
277
+ In the [Selector](#selector) class, all methods/properties that should return a group of elements return them as a [Selectors](#selectors) class instance.
278
+
279
+ Starting with v0.4, all selection methods consistently return [Selector](#selector)/[Selectors](#selectors) objects, even for text nodes and attribute values. Text nodes (selected via `::text`, `/text()`, `::attr()`, `/@attr`) are wrapped in [Selector](#selector) objects. These text node selectors have `tag` set to `"#text"`, and their `text` property returns the text value. You can still access the text value directly, and all other properties return empty/default values gracefully.
280
+
281
+ ```python
282
+ >>> page.css('a::text') # -> Selectors (of text node Selectors)
283
+ >>> page.xpath('//a/text()') # -> Selectors
284
+ >>> page.css('a::text').get() # -> TextHandler (the first text value)
285
+ >>> page.css('a::text').getall() # -> TextHandlers (all text values)
286
+ >>> page.css('a::attr(href)') # -> Selectors
287
+ >>> page.xpath('//a/@href') # -> Selectors
288
+ >>> page.css('.price_color') # -> Selectors
289
+ ```
290
+
291
+ ### Data extraction methods
292
+ Starting with v0.4, [Selector](#selector) and [Selectors](#selectors) both provide `get()`, `getall()`, and their aliases `extract_first` and `extract` (following Scrapy conventions). The old `get_all()` method has been removed.
293
+
294
+ **On a [Selector](#selector) object:**
295
+
296
+ - `get()` returns a `TextHandler`: for text node selectors, it returns the text value; for HTML element selectors, it returns the serialized outer HTML.
297
+ - `getall()` returns a `TextHandlers` list containing the single serialized string.
298
+ - `extract_first` is an alias for `get()`, and `extract` is an alias for `getall()`.
299
+
300
+ ```python
301
+ >>> page.css('h3')[0].get() # Outer HTML of the element
302
+ '<h3>Product 1</h3>'
303
+
304
+ >>> page.css('h3::text')[0].get() # Text value of the text node
305
+ 'Product 1'
306
+ ```
307
+
308
+ **On a [Selectors](#selectors) object:**
309
+
310
+ - `get(default=None)` returns the serialized string of the **first** element, or `default` if the list is empty.
311
+ - `getall()` serializes **all** elements and returns a `TextHandlers` list.
312
+ - `extract_first` is an alias for `get()`, and `extract` is an alias for `getall()`.
313
+
314
+ ```python
315
+ >>> page.css('.price::text').get() # First price text
316
+ '$10.99'
317
+
318
+ >>> page.css('.price::text').getall() # All price texts
319
+ ['$10.99', '$20.99', '$15.99']
320
+
321
+ >>> page.css('.price::text').get('') # With default value
322
+ '$10.99'
323
+ ```
324
+
325
+ These methods work seamlessly with all selection types (CSS, XPath, `find`, etc.) and are the recommended way to extract text and attribute values in a Scrapy-compatible style.
326
+
327
+ ### Properties
328
+ Apart from the standard operations on Python lists (iteration, slicing, etc.), the following operations are available:
329
+
330
+ CSS and XPath selectors can be executed directly on the [Selector](#selector) instances, with the same return types as [Selector](#selector)'s `css` and `xpath` methods. The arguments are similar, except the `adaptive` argument is not available. This makes chaining methods straightforward:
331
+ ```python
332
+ >>> page.css('.product_pod a')
333
+ [<data='<a href="catalogue/a-light-in-the-attic_...' parent='<div class="image_container"> <a href="c...'>,
334
+ <data='<a href="catalogue/a-light-in-the-attic_...' parent='<h3><a href="catalogue/a-light-in-the-at...'>,
335
+ <data='<a href="catalogue/tipping-the-velvet_99...' parent='<div class="image_container"> <a href="c...'>,
336
+ <data='<a href="catalogue/tipping-the-velvet_99...' parent='<h3><a href="catalogue/tipping-the-velve...'>,
337
+ <data='<a href="catalogue/soumission_998/index....' parent='<div class="image_container"> <a href="c...'>,
338
+ <data='<a href="catalogue/soumission_998/index....' parent='<h3><a href="catalogue/soumission_998/in...'>,
339
+ ...]
340
+
341
+ >>> page.css('.product_pod').css('a') # Returns the same result
342
+ [<data='<a href="catalogue/a-light-in-the-attic_...' parent='<div class="image_container"> <a href="c...'>,
343
+ <data='<a href="catalogue/a-light-in-the-attic_...' parent='<h3><a href="catalogue/a-light-in-the-at...'>,
344
+ <data='<a href="catalogue/tipping-the-velvet_99...' parent='<div class="image_container"> <a href="c...'>,
345
+ <data='<a href="catalogue/tipping-the-velvet_99...' parent='<h3><a href="catalogue/tipping-the-velve...'>,
346
+ <data='<a href="catalogue/soumission_998/index....' parent='<div class="image_container"> <a href="c...'>,
347
+ <data='<a href="catalogue/soumission_998/index....' parent='<h3><a href="catalogue/soumission_998/in...'>,
348
+ ...]
349
+ ```
350
+ The `re` and `re_first` methods can be run directly. They take the same arguments as the [Selector](#selector) class. In this class, `re_first` runs `re` on each [Selector](#selector) within and returns the first one with a result. The `re` method returns a [TextHandlers](#texthandlers) object combining all matches:
351
+ ```python
352
+ >>> page.css('.price_color').re(r'[\d\.]+')
353
+ ['51.77',
354
+ '53.74',
355
+ '50.10',
356
+ '47.82',
357
+ '54.23',
358
+ ...]
359
+
360
+ >>> page.css('.product_pod h3 a::attr(href)').re(r'catalogue/(.*)/index.html')
361
+ ['a-light-in-the-attic_1000',
362
+ 'tipping-the-velvet_999',
363
+ 'soumission_998',
364
+ 'sharp-objects_997',
365
+ ...]
366
+ ```
367
+ The `search` method searches the available [Selector](#selector) instances. The function passed must accept a [Selector](#selector) instance as the first argument and return True/False. Returns the first matching [Selector](#selector) instance, or `None`:
368
+ ```python
369
+ # Find all the products with price '53.23'.
370
+ >>> search_function = lambda p: float(p.css('.price_color').re_first(r'[\d\.]+')) == 54.23
371
+ >>> page.css('.product_pod').search(search_function)
372
+ <data='<article class="product_pod"><div class=...' parent='<li class="col-xs-6 col-sm-4 col-md-3 co...'>
373
+ ```
374
+ The `filter` method takes a function like `search` but returns a `Selectors` instance of all matching [Selector](#selector) instances:
375
+ ```python
376
+ # Find all products with prices over $50
377
+ >>> filtering_function = lambda p: float(p.css('.price_color').re_first(r'[\d\.]+')) > 50
378
+ >>> page.css('.product_pod').filter(filtering_function)
379
+ [<data='<article class="product_pod"><div class=...' parent='<li class="col-xs-6 col-sm-4 col-md-3 co...'>,
380
+ <data='<article class="product_pod"><div class=...' parent='<li class="col-xs-6 col-sm-4 col-md-3 co...'>,
381
+ <data='<article class="product_pod"><div class=...' parent='<li class="col-xs-6 col-sm-4 col-md-3 co...'>,
382
+ ...]
383
+ ```
384
+ Safe access to the first or last element without index errors:
385
+ ```python
386
+ >>> page.css('.product').first # First Selector or None
387
+ <data='<article class="product" data-id="1"><h3...'>
388
+ >>> page.css('.product').last # Last Selector or None
389
+ <data='<article class="product" data-id="3"><h3...'>
390
+ >>> page.css('.nonexistent').first # Returns None instead of raising IndexError
391
+ ```
392
+
393
+ Get the number of [Selector](#selector) instances in a [Selectors](#selectors) instance:
394
+ ```python
395
+ page.css('.product_pod').length
396
+ ```
397
+ which is equivalent to
398
+ ```python
399
+ len(page.css('.product_pod'))
400
+ ```
401
+
402
+ ## TextHandler
403
+ All methods/properties that return a string return `TextHandler`, and those that return a list of strings return [TextHandlers](#texthandlers) instead.
404
+
405
+ TextHandler is a subclass of the standard Python string, so all standard string operations are supported.
406
+
407
+ TextHandler provides extra methods and properties beyond standard Python strings. All methods and properties in all classes that return string(s) return TextHandler, enabling chaining and cleaner code. It can also be imported directly and used on any string.
408
+ ### Usage
409
+ All operations (slicing, indexing, etc.) and methods (`split`, `replace`, `strip`, etc.) return a `TextHandler`, so they can be chained.
410
+
411
+ The `re` and `re_first` methods exist in [Selector](#selector), [Selectors](#selectors), and [TextHandlers](#texthandlers) as well, accepting the same arguments.
412
+
413
+ - The `re` method takes a string/compiled regex pattern as the first argument. It searches the data for all strings matching the regex and returns them as a [TextHandlers](#texthandlers) instance. The `re_first` method takes the same arguments but returns only the first result as a `TextHandler` instance.
414
+
415
+ Also, it takes other helpful arguments, which are:
416
+
417
+ - **replace_entities**: This is enabled by default. It replaces character entity references with their corresponding characters.
418
+ - **clean_match**: It's disabled by default. This causes the method to ignore all whitespace, including consecutive spaces, while matching.
419
+ - **case_sensitive**: It's enabled by default. As the name implies, disabling it causes the regex to ignore letter case during compilation.
420
+
421
+ The return result is [TextHandlers](#texthandlers) because the `re` method is used:
422
+ ```python
423
+ >>> page.css('.price_color').re(r'[\d\.]+')
424
+ ['51.77',
425
+ '53.74',
426
+ '50.10',
427
+ '47.82',
428
+ '54.23',
429
+ ...]
430
+
431
+ >>> page.css('.product_pod h3 a::attr(href)').re(r'catalogue/(.*)/index.html')
432
+ ['a-light-in-the-attic_1000',
433
+ 'tipping-the-velvet_999',
434
+ 'soumission_998',
435
+ 'sharp-objects_997',
436
+ ...]
437
+ ```
438
+ Examples with custom strings demonstrating the other arguments:
439
+ ```python
440
+ >>> from scrapling import TextHandler
441
+ >>> test_string = TextHandler('hi there') # Hence the two spaces
442
+ >>> test_string.re('hi there')
443
+ >>> test_string.re('hi there', clean_match=True) # Using `clean_match` will clean the string before matching the regex
444
+ ['hi there']
445
+
446
+ >>> test_string2 = TextHandler('Oh, Hi Mark')
447
+ >>> test_string2.re_first('oh, hi Mark')
448
+ >>> test_string2.re_first('oh, hi Mark', case_sensitive=False) # Hence disabling `case_sensitive`
449
+ 'Oh, Hi Mark'
450
+
451
+ # Mixing arguments
452
+ >>> test_string.re('hi there', clean_match=True, case_sensitive=False)
453
+ ['hi There']
454
+ ```
455
+ Since `html_content` returns `TextHandler`, regex can be applied directly on HTML content:
456
+ ```python
457
+ >>> page.html_content.re('div class=".*">(.*)</div')
458
+ ['In stock: 5', 'In stock: 3', 'Out of stock']
459
+ ```
460
+
461
+ - The `.json()` method converts the content to a JSON object if possible; otherwise, it throws an error:
462
+ ```python
463
+ >>> page.css('#page-data::text').get()
464
+ '\n {\n "lastUpdated": "2024-09-22T10:30:00Z",\n "totalProducts": 3\n }\n '
465
+ >>> page.css('#page-data::text').get().json()
466
+ {'lastUpdated': '2024-09-22T10:30:00Z', 'totalProducts': 3}
467
+ ```
468
+ If no text node is specified while selecting an element, the text content is selected automatically:
469
+ ```python
470
+ >>> page.css('#page-data')[0].json()
471
+ {'lastUpdated': '2024-09-22T10:30:00Z', 'totalProducts': 3}
472
+ ```
473
+ The [Selector](#selector) class adds additional behavior. Given this page:
474
+ ```html
475
+ <html>
476
+ <body>
477
+ <div>
478
+ <script id="page-data" type="application/json">
479
+ {
480
+ "lastUpdated": "2024-09-22T10:30:00Z",
481
+ "totalProducts": 3
482
+ }
483
+ </script>
484
+ </div>
485
+ </body>
486
+ </html>
487
+ ```
488
+ The [Selector](#selector) class has the `get_all_text` method, which returns a `TextHandler`. For example:
489
+ ```python
490
+ >>> page.css('div::text').get().json()
491
+ ```
492
+ This throws an error because the `div` tag has no direct text content. The `get_all_text` method handles this case:
493
+ ```python
494
+ >>> page.css('div')[0].get_all_text(ignore_tags=[]).json()
495
+ {'lastUpdated': '2024-09-22T10:30:00Z', 'totalProducts': 3}
496
+ ```
497
+ The `ignore_tags` argument is used here because its default value is `('script', 'style',)`.
498
+
499
+ When dealing with a JSON response:
500
+ ```python
501
+ >>> page = Selector("""{"some_key": "some_value"}""")
502
+ ```
503
+ The [Selector](#selector) class is optimized for HTML, so it treats this as a broken HTML response and wraps it. The `html_content` property shows:
504
+ ```python
505
+ >>> page.html_content
506
+ '<html><body><p>{"some_key": "some_value"}</p></body></html>'
507
+ ```
508
+ The `json` method can be used directly:
509
+ ```python
510
+ >>> page.json()
511
+ {'some_key': 'some_value'}
512
+ ```
513
+ For JSON responses, the [Selector](#selector) class keeps a raw copy of the content it receives. When `.json()` is called, it checks for that raw copy first and converts it to JSON. If the raw copy is unavailable (as with sub-elements), it checks the current element's text content, then falls back to `get_all_text`.
514
+
515
+ - The `.clean()` method removes all whitespace and consecutive spaces, returning a new `TextHandler` instance:
516
+ ```python
517
+ >>> TextHandler('\n wonderful idea, \reh?').clean()
518
+ 'wonderful idea, eh?'
519
+ ```
520
+ The `remove_entities` argument causes `clean` to replace HTML entities with their corresponding characters.
521
+
522
+ - The `.sort()` method sorts the string characters:
523
+ ```python
524
+ >>> TextHandler('acb').sort()
525
+ 'abc'
526
+ ```
527
+ Or do it in reverse:
528
+ ```python
529
+ >>> TextHandler('acb').sort(reverse=True)
530
+ 'cba'
531
+ ```
532
+
533
+ This class is returned in place of strings nearly everywhere in the library.
534
+
535
+ ## TextHandlers
536
+ This class inherits from standard lists, adding `re` and `re_first` as new methods.
537
+
538
+ The `re_first` method runs `re` on each [TextHandler](#texthandler) and returns the first result, or `None`.
539
+
540
+ ## AttributesHandler
541
+ This is a read-only version of Python's standard dictionary, or `dict`, used solely to store the attributes of each element/[Selector](#selector) instance.
542
+ ```python
543
+ >>> print(page.find('script').attrib)
544
+ {'id': 'page-data', 'type': 'application/json'}
545
+ >>> type(page.find('script').attrib).__name__
546
+ 'AttributesHandler'
547
+ ```
548
+ Because it's read-only, it will use fewer resources than the standard dictionary. Still, it has the same dictionary method and properties, except those that allow you to modify/override the data.
549
+
550
+ It currently adds two extra simple methods:
551
+
552
+ - The `search_values` method
553
+
554
+ Searches the current attributes by values (rather than keys) and returns a dictionary of each matching item.
555
+
556
+ A simple example would be
557
+ ```python
558
+ >>> for i in page.find('script').attrib.search_values('page-data'):
559
+ print(i)
560
+ {'id': 'page-data'}
561
+ ```
562
+ But this method provides the `partial` argument as well, which allows you to search by part of the value:
563
+ ```python
564
+ >>> for i in page.find('script').attrib.search_values('page', partial=True):
565
+ print(i)
566
+ {'id': 'page-data'}
567
+ ```
568
+ A more practical example is using it with `find_all` to find all elements that have a specific value in their attributes:
569
+ ```python
570
+ >>> page.find_all(lambda element: list(element.attrib.search_values('product')))
571
+ [<data='<article class="product" data-id="1"><h3...' parent='<div class="product-list"> <article clas...'>,
572
+ <data='<article class="product" data-id="2"><h3...' parent='<div class="product-list"> <article clas...'>,
573
+ <data='<article class="product" data-id="3"><h3...' parent='<div class="product-list"> <article clas...'>]
574
+ ```
575
+ All these elements have 'product' as the value for the `class` attribute.
576
+
577
+ The `list` function is used here because `search_values` returns a generator, so it would be `True` for all elements.
578
+
579
+ - The `json_string` property
580
+
581
+ This property converts current attributes to a JSON string if the attributes are JSON serializable; otherwise, it throws an error.
582
+
583
+ ```python
584
+ >>>page.find('script').attrib.json_string
585
+ b'{"id":"page-data","type":"application/json"}'
586
+ ```