ultimate-pi 0.1.0
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- package/.agents/skills/caveman/SKILL.md +67 -0
- package/.agents/skills/compress/SKILL.md +111 -0
- package/.agents/skills/compress/scripts/__init__.py +9 -0
- package/.agents/skills/compress/scripts/__main__.py +3 -0
- package/.agents/skills/compress/scripts/benchmark.py +78 -0
- package/.agents/skills/compress/scripts/cli.py +73 -0
- package/.agents/skills/compress/scripts/compress.py +227 -0
- package/.agents/skills/compress/scripts/detect.py +121 -0
- package/.agents/skills/compress/scripts/validate.py +189 -0
- package/.agents/skills/context7-cli/SKILL.md +73 -0
- package/.agents/skills/context7-cli/references/docs.md +121 -0
- package/.agents/skills/context7-cli/references/setup.md +43 -0
- package/.agents/skills/context7-cli/references/skills.md +118 -0
- package/.agents/skills/emil-design-eng/SKILL.md +679 -0
- package/.agents/skills/lean-ctx/SKILL.md +149 -0
- package/.agents/skills/lean-ctx/scripts/install.sh +95 -0
- package/.agents/skills/scrapling-official/LICENSE.txt +28 -0
- package/.agents/skills/scrapling-official/SKILL.md +390 -0
- package/.agents/skills/scrapling-official/examples/01_fetcher_session.py +26 -0
- package/.agents/skills/scrapling-official/examples/02_dynamic_session.py +26 -0
- package/.agents/skills/scrapling-official/examples/03_stealthy_session.py +26 -0
- package/.agents/skills/scrapling-official/examples/04_spider.py +58 -0
- package/.agents/skills/scrapling-official/examples/README.md +45 -0
- package/.agents/skills/scrapling-official/references/fetching/choosing.md +78 -0
- package/.agents/skills/scrapling-official/references/fetching/dynamic.md +352 -0
- package/.agents/skills/scrapling-official/references/fetching/static.md +432 -0
- package/.agents/skills/scrapling-official/references/fetching/stealthy.md +255 -0
- package/.agents/skills/scrapling-official/references/mcp-server.md +214 -0
- package/.agents/skills/scrapling-official/references/migrating_from_beautifulsoup.md +86 -0
- package/.agents/skills/scrapling-official/references/parsing/adaptive.md +212 -0
- package/.agents/skills/scrapling-official/references/parsing/main_classes.md +586 -0
- package/.agents/skills/scrapling-official/references/parsing/selection.md +494 -0
- package/.agents/skills/scrapling-official/references/spiders/advanced.md +344 -0
- package/.agents/skills/scrapling-official/references/spiders/architecture.md +94 -0
- package/.agents/skills/scrapling-official/references/spiders/getting-started.md +164 -0
- package/.agents/skills/scrapling-official/references/spiders/proxy-blocking.md +235 -0
- package/.agents/skills/scrapling-official/references/spiders/requests-responses.md +196 -0
- package/.agents/skills/scrapling-official/references/spiders/sessions.md +205 -0
- package/.github/banner.png +0 -0
- package/.pi/SYSTEM.md +40 -0
- package/.pi/settings.json +5 -0
- package/PLAN.md +11 -0
- package/README.md +58 -0
- package/extensions/lean-ctx-enforce.ts +166 -0
- package/package.json +17 -0
- package/skills-lock.json +35 -0
- package/wiki/README.md +10 -0
- package/wiki/decisions/0001-establish-project-wiki-and-decision-record-format.md +25 -0
- package/wiki/decisions/0002-add-project-banner-to-readme.md +26 -0
- package/wiki/decisions/0003-remove-redundant-readme-title-heading.md +26 -0
- package/wiki/decisions/0004-publish-package-to-npm-as-ultimate-pi.md +26 -0
|
@@ -0,0 +1,212 @@
|
|
|
1
|
+
# Adaptive scraping
|
|
2
|
+
|
|
3
|
+
Adaptive scraping (previously known as automatch) is one of Scrapling's most powerful features. It allows your scraper to survive website changes by intelligently tracking and relocating elements.
|
|
4
|
+
|
|
5
|
+
Consider a page with a structure like this:
|
|
6
|
+
```html
|
|
7
|
+
<div class="container">
|
|
8
|
+
<section class="products">
|
|
9
|
+
<article class="product" id="p1">
|
|
10
|
+
<h3>Product 1</h3>
|
|
11
|
+
<p class="description">Description 1</p>
|
|
12
|
+
</article>
|
|
13
|
+
<article class="product" id="p2">
|
|
14
|
+
<h3>Product 2</h3>
|
|
15
|
+
<p class="description">Description 2</p>
|
|
16
|
+
</article>
|
|
17
|
+
</section>
|
|
18
|
+
</div>
|
|
19
|
+
```
|
|
20
|
+
To scrape the first product (the one with the `p1` ID), a selector like this would be used:
|
|
21
|
+
```python
|
|
22
|
+
page.css('#p1')
|
|
23
|
+
```
|
|
24
|
+
When website owners implement structural changes like
|
|
25
|
+
```html
|
|
26
|
+
<div class="new-container">
|
|
27
|
+
<div class="product-wrapper">
|
|
28
|
+
<section class="products">
|
|
29
|
+
<article class="product new-class" data-id="p1">
|
|
30
|
+
<div class="product-info">
|
|
31
|
+
<h3>Product 1</h3>
|
|
32
|
+
<p class="new-description">Description 1</p>
|
|
33
|
+
</div>
|
|
34
|
+
</article>
|
|
35
|
+
<article class="product new-class" data-id="p2">
|
|
36
|
+
<div class="product-info">
|
|
37
|
+
<h3>Product 2</h3>
|
|
38
|
+
<p class="new-description">Description 2</p>
|
|
39
|
+
</div>
|
|
40
|
+
</article>
|
|
41
|
+
</section>
|
|
42
|
+
</div>
|
|
43
|
+
</div>
|
|
44
|
+
```
|
|
45
|
+
The selector will no longer function, and your code needs maintenance. That's where Scrapling's `adaptive` feature comes into play.
|
|
46
|
+
|
|
47
|
+
With Scrapling, you can enable the `adaptive` feature the first time you select an element, and the next time you select that element and it doesn't exist, Scrapling will remember its properties and search on the website for the element with the highest percentage of similarity to that element.
|
|
48
|
+
|
|
49
|
+
```python
|
|
50
|
+
from scrapling import Selector, Fetcher
|
|
51
|
+
# Before the change
|
|
52
|
+
page = Selector(page_source, adaptive=True, url='example.com')
|
|
53
|
+
# or
|
|
54
|
+
Fetcher.adaptive = True
|
|
55
|
+
page = Fetcher.get('https://example.com')
|
|
56
|
+
# then
|
|
57
|
+
element = page.css('#p1', auto_save=True)
|
|
58
|
+
if not element: # One day website changes?
|
|
59
|
+
element = page.css('#p1', adaptive=True) # Scrapling still finds it!
|
|
60
|
+
# the rest of your code...
|
|
61
|
+
```
|
|
62
|
+
It works with all selection methods, not just CSS/XPath selection.
|
|
63
|
+
|
|
64
|
+
## Real-World Scenario
|
|
65
|
+
This example uses [The Web Archive](https://archive.org/)'s [Wayback Machine](https://web.archive.org/) to demonstrate adaptive scraping across different versions of a website. A copy of [StackOverflow's website in 2010](https://web.archive.org/web/20100102003420/http://stackoverflow.com/) is compared against the current design to show that the adaptive feature can extract the same button using the same selector.
|
|
66
|
+
|
|
67
|
+
To extract the Questions button from the old design, a selector like `#hmenus > div:nth-child(1) > ul > li:nth-child(1) > a` can be used (this specific selector was generated by Chrome).
|
|
68
|
+
|
|
69
|
+
Testing the same selector in both versions:
|
|
70
|
+
```python
|
|
71
|
+
>> from scrapling import Fetcher
|
|
72
|
+
>> selector = '#hmenus > div:nth-child(1) > ul > li:nth-child(1) > a'
|
|
73
|
+
>> old_url = "https://web.archive.org/web/20100102003420/http://stackoverflow.com/"
|
|
74
|
+
>> new_url = "https://stackoverflow.com/"
|
|
75
|
+
>> Fetcher.configure(adaptive = True, adaptive_domain='stackoverflow.com')
|
|
76
|
+
>>
|
|
77
|
+
>> page = Fetcher.get(old_url, timeout=30)
|
|
78
|
+
>> element1 = page.css(selector, auto_save=True)[0]
|
|
79
|
+
>>
|
|
80
|
+
>> # Same selector but used in the updated website
|
|
81
|
+
>> page = Fetcher.get(new_url)
|
|
82
|
+
>> element2 = page.css(selector, adaptive=True)[0]
|
|
83
|
+
>>
|
|
84
|
+
>> if element1.text == element2.text:
|
|
85
|
+
... print('Scrapling found the same element in the old and new designs!')
|
|
86
|
+
'Scrapling found the same element in the old and new designs!'
|
|
87
|
+
```
|
|
88
|
+
The `adaptive_domain` argument is used here because Scrapling sees `archive.org` and `stackoverflow.com` as two different domains and would isolate their `adaptive` data. Passing `adaptive_domain` tells Scrapling to treat them as the same website for adaptive data storage.
|
|
89
|
+
|
|
90
|
+
In a typical scenario with the same URL for both requests, the `adaptive_domain` argument is not needed. The adaptive logic works the same way with both the `Selector` and `Fetcher` classes.
|
|
91
|
+
|
|
92
|
+
**Note:** The main reason for creating the `adaptive_domain` argument was to handle if the website changed its URL while changing the design/structure. In that case, it can be used to continue using the previously stored adaptive data for the new URL. Otherwise, Scrapling will consider it a new website and discard the old data.
|
|
93
|
+
|
|
94
|
+
## How the adaptive scraping feature works
|
|
95
|
+
Adaptive scraping works in two phases:
|
|
96
|
+
|
|
97
|
+
1. **Save Phase**: Store unique properties of elements
|
|
98
|
+
2. **Match Phase**: Find elements with similar properties later
|
|
99
|
+
|
|
100
|
+
After selecting an element through any method, the library can find it the next time the website is scraped, even if it undergoes structural/design changes.
|
|
101
|
+
|
|
102
|
+
The general logic is as follows:
|
|
103
|
+
|
|
104
|
+
1. Scrapling saves that element's unique properties (methods shown below).
|
|
105
|
+
2. Scrapling uses its configured database (SQLite by default) and saves each element's unique properties.
|
|
106
|
+
3. Because everything about the element can be changed or removed by the website's owner(s), nothing from the element can be used as a unique identifier for the database. The storage system relies on two things:
|
|
107
|
+
1. The domain of the current website. When using the `Selector` class, pass it when initializing; when using a fetcher, the domain is automatically taken from the URL.
|
|
108
|
+
2. An `identifier` to query that element's properties from the database. The identifier does not always need to be set manually (see below).
|
|
109
|
+
|
|
110
|
+
Together, they will later be used to retrieve the element's unique properties from the database.
|
|
111
|
+
|
|
112
|
+
4. Later, when the website's structure changes, enabling `adaptive` causes Scrapling to retrieve the element's unique properties and match all elements on the page against them. A score is calculated based on their similarity to the desired element. Everything is taken into consideration in that comparison.
|
|
113
|
+
5. The element(s) with the highest similarity score to the wanted element are returned.
|
|
114
|
+
|
|
115
|
+
### The unique properties
|
|
116
|
+
The unique properties Scrapling relies on are:
|
|
117
|
+
|
|
118
|
+
- Element tag name, text, attributes (names and values), siblings (tag names only), and path (tag names only).
|
|
119
|
+
- Element's parent tag name, attributes (names and values), and text.
|
|
120
|
+
|
|
121
|
+
The comparison between elements is not exact; it is based on how similar these values are. Everything is considered, including the values' order (e.g., the order in which class names are written).
|
|
122
|
+
|
|
123
|
+
## How to use adaptive feature
|
|
124
|
+
The adaptive feature can be applied to any found element and is added as arguments to CSS/XPath selection methods.
|
|
125
|
+
|
|
126
|
+
First, enable the `adaptive` feature by passing `adaptive=True` to the [Selector](main_classes.md#selector) class when initializing it, or enable it on the fetcher being used.
|
|
127
|
+
|
|
128
|
+
Examples:
|
|
129
|
+
```python
|
|
130
|
+
>>> from scrapling import Selector, Fetcher
|
|
131
|
+
>>> page = Selector(html_doc, adaptive=True)
|
|
132
|
+
# OR
|
|
133
|
+
>>> Fetcher.adaptive = True
|
|
134
|
+
>>> page = Fetcher.get('https://example.com')
|
|
135
|
+
```
|
|
136
|
+
When using the [Selector](main_classes.md#selector) class, pass the URL of the website with the `url` argument so Scrapling can separate the properties saved for each element by domain.
|
|
137
|
+
|
|
138
|
+
If no URL is passed, the word `default` will be used in place of the URL field while saving the element's unique properties. This is only an issue when using the same identifier for a different website without passing the URL parameter. The save process overwrites previous data, and the `adaptive` feature uses only the latest saved properties.
|
|
139
|
+
|
|
140
|
+
The `storage` and `storage_args` arguments control the database connection; by default, the SQLite class provided by the library is used.
|
|
141
|
+
|
|
142
|
+
There are two main ways to use the `adaptive` feature:
|
|
143
|
+
|
|
144
|
+
### The CSS/XPath Selection way
|
|
145
|
+
First, use the `auto_save` argument while selecting an element that exists on the page:
|
|
146
|
+
```python
|
|
147
|
+
element = page.css('#p1', auto_save=True)
|
|
148
|
+
```
|
|
149
|
+
When the element no longer exists, use the same selector with the `adaptive` argument to have the library find it:
|
|
150
|
+
```python
|
|
151
|
+
element = page.css('#p1', adaptive=True)
|
|
152
|
+
```
|
|
153
|
+
With the `css`/`xpath` methods, the identifier is set automatically to the selector string passed to the method.
|
|
154
|
+
|
|
155
|
+
Additionally, for all these methods, you can pass the `identifier` argument to set it yourself. This is useful in some instances, or you can use it to save properties with the `auto_save` argument.
|
|
156
|
+
|
|
157
|
+
### The manual way
|
|
158
|
+
Elements can be manually saved, retrieved, and relocated within the `adaptive` feature. This allows relocating any element found by any method.
|
|
159
|
+
|
|
160
|
+
Example of getting an element by text:
|
|
161
|
+
```python
|
|
162
|
+
>>> element = page.find_by_text('Tipping the Velvet', first_match=True)
|
|
163
|
+
```
|
|
164
|
+
Save its unique properties using the `save` method. The identifier must be set manually (use a meaningful identifier):
|
|
165
|
+
```python
|
|
166
|
+
>>> page.save(element, 'my_special_element')
|
|
167
|
+
```
|
|
168
|
+
Later, retrieve and relocate the element inside the page with `adaptive`:
|
|
169
|
+
```python
|
|
170
|
+
>>> element_dict = page.retrieve('my_special_element')
|
|
171
|
+
>>> page.relocate(element_dict, selector_type=True)
|
|
172
|
+
[<data='<a href="catalogue/tipping-the-velvet_99...' parent='<h3><a href="catalogue/tipping-the-velve...'>]
|
|
173
|
+
>>> page.relocate(element_dict, selector_type=True).css('::text').getall()
|
|
174
|
+
['Tipping the Velvet']
|
|
175
|
+
```
|
|
176
|
+
The `retrieve` and `relocate` methods are used here.
|
|
177
|
+
|
|
178
|
+
To keep it as a `lxml.etree` object, omit the `selector_type` argument:
|
|
179
|
+
```python
|
|
180
|
+
>>> page.relocate(element_dict)
|
|
181
|
+
[<Element a at 0x105a2a7b0>]
|
|
182
|
+
```
|
|
183
|
+
|
|
184
|
+
## Troubleshooting
|
|
185
|
+
|
|
186
|
+
### No Matches Found
|
|
187
|
+
```python
|
|
188
|
+
# 1. Check if data was saved
|
|
189
|
+
element_data = page.retrieve('identifier')
|
|
190
|
+
if not element_data:
|
|
191
|
+
print("No data saved for this identifier")
|
|
192
|
+
|
|
193
|
+
# 2. Try with different identifier
|
|
194
|
+
products = page.css('.product', adaptive=True, identifier='old_selector')
|
|
195
|
+
|
|
196
|
+
# 3. Save again with new identifier
|
|
197
|
+
products = page.css('.new-product', auto_save=True, identifier='new_identifier')
|
|
198
|
+
```
|
|
199
|
+
|
|
200
|
+
### Wrong Elements Matched
|
|
201
|
+
```python
|
|
202
|
+
# Use more specific selectors
|
|
203
|
+
products = page.css('.product-list .product', auto_save=True)
|
|
204
|
+
|
|
205
|
+
# Or save with more context
|
|
206
|
+
product = page.find_by_text('Product Name').parent
|
|
207
|
+
page.save(product, 'specific_product')
|
|
208
|
+
```
|
|
209
|
+
|
|
210
|
+
## Known Issues
|
|
211
|
+
In the `adaptive` save process, only the unique properties of the first element in the selection results are saved. So if the selector you are using selects different elements on the page in other locations, `adaptive` will return the first element to you only when you relocate it later. This doesn't include combined CSS selectors (Using commas to combine more than one selector, for example), as these selectors are separated and each is executed alone.
|
|
212
|
+
|