scrapling 0.1.2__py3-none-any.whl → 0.2__py3-none-any.whl

Sign up to get free protection for your applications and to get access to all the features.
@@ -1,477 +0,0 @@
1
- Metadata-Version: 2.1
2
- Name: scrapling
3
- Version: 0.1.2
4
- Summary: Scrapling is a powerful, flexible, and high-performance web scraping library for Python. It
5
- Home-page: https://github.com/D4Vinci/Scrapling
6
- Author: Karim Shoair
7
- Author-email: karim.shoair@pm.me
8
- License: BSD
9
- Project-URL: Documentation, https://github.com/D4Vinci/Scrapling/tree/main/docs
10
- Project-URL: Source, https://github.com/D4Vinci/Scrapling
11
- Project-URL: Tracker, https://github.com/D4Vinci/Scrapling/issues
12
- Classifier: Operating System :: OS Independent
13
- Classifier: Development Status :: 4 - Beta
14
- Classifier: Intended Audience :: Developers
15
- Classifier: License :: OSI Approved :: BSD License
16
- Classifier: Natural Language :: English
17
- Classifier: Topic :: Internet :: WWW/HTTP
18
- Classifier: Topic :: Text Processing :: Markup
19
- Classifier: Topic :: Text Processing :: Markup :: HTML
20
- Classifier: Topic :: Software Development :: Libraries :: Python Modules
21
- Classifier: Programming Language :: Python :: 3
22
- Classifier: Programming Language :: Python :: 3 :: Only
23
- Classifier: Programming Language :: Python :: 3.7
24
- Classifier: Programming Language :: Python :: 3.8
25
- Classifier: Programming Language :: Python :: 3.9
26
- Classifier: Programming Language :: Python :: 3.10
27
- Classifier: Programming Language :: Python :: 3.11
28
- Classifier: Programming Language :: Python :: 3.12
29
- Classifier: Programming Language :: Python :: Implementation :: CPython
30
- Classifier: Typing :: Typed
31
- Requires-Python: >=3.7
32
- Description-Content-Type: text/markdown
33
- License-File: LICENSE
34
- Requires-Dist: requests >=2.3
35
- Requires-Dist: lxml >=4.5
36
- Requires-Dist: cssselect >=1.2
37
- Requires-Dist: w3lib
38
- Requires-Dist: orjson >=3
39
- Requires-Dist: tldextract
40
-
41
- # 🕷️ Scrapling: Lightning-Fast, Adaptive Web Scraping for Python
42
- [![Tests](https://github.com/D4Vinci/Scrapling/actions/workflows/tests.yml/badge.svg)](https://github.com/D4Vinci/Scrapling/actions/workflows/tests.yml) [![PyPI version](https://badge.fury.io/py/Scrapling.svg)](https://badge.fury.io/py/Scrapling) [![Supported Python versions](https://img.shields.io/pypi/pyversions/scrapling.svg)](https://pypi.org/project/scrapling/) [![PyPI Downloads](https://static.pepy.tech/badge/scrapling)](https://pepy.tech/project/scrapling)
43
-
44
- Dealing with failing web scrapers due to website changes? Meet Scrapling.
45
-
46
- Scrapling is a high-performance, intelligent web scraping library for Python that automatically adapts to website changes while significantly outperforming popular alternatives. Whether you're a beginner or an expert, Scrapling provides powerful features while maintaining simplicity.
47
-
48
- ```python
49
- from scrapling import Adaptor
50
-
51
- # Scrape data that survives website changes
52
- page = Adaptor(html, auto_match=True)
53
- products = page.css('.product', auto_save=True)
54
- # Later, even if selectors change:
55
- products = page.css('.product', auto_match=True) # Still finds them!
56
- ```
57
-
58
- ## Key Features
59
-
60
- ### Adaptive Scraping
61
- - 🔄 **Smart Element Tracking**: Locate previously identified elements after website structure changes, using an intelligent similarity system and integrated storage.
62
- - 🎯 **Flexible Querying**: Use CSS selectors, XPath, text search, or regex - chain them however you want!
63
- - 🔍 **Find Similar Elements**: Automatically locate elements similar to the element you want on the page (Ex: other products like the product you found on the page).
64
- - 🧠 **Smart Content Scraping**: Extract data from multiple websites without specific selectors using its powerful features.
65
-
66
- ### Performance
67
- - 🚀 **Lightning Fast**: Built from the ground up with performance in mind, outperforming most popular Python scraping libraries (outperforming BeautifulSoup by up to 237x in our tests).
68
- - 🔋 **Memory Efficient**: Optimized data structures for minimal memory footprint.
69
- - ⚡ **Fast JSON serialization**: 10x faster JSON serialization than the standard json library with more options.
70
-
71
- ### Developing Experience
72
- - 🛠️ **Powerful Navigation API**: Traverse the DOM tree easily in all directions and get the info you want (parent, ancestors, sibling, children, next/previous element, and more).
73
- - 🧬 **Rich Text Processing**: All strings have built-in methods for regex matching, cleaning, and more. All elements' attributes are read-only dictionaries that are faster than standard dictionaries with added methods.
74
- - 📝 **Automatic Selector Generation**: Create robust CSS/XPath selectors for any element.
75
- - 🔌 **Scrapy-Compatible API**: Familiar methods and similar pseudo-elements for Scrapy users.
76
- - 📘 **Type hints**: Complete type coverage for better IDE support and fewer bugs.
77
-
78
- ## Getting Started
79
-
80
- Let's walk through a basic example that demonstrates a small group of Scrapling's core features:
81
-
82
- ```python
83
- import requests
84
- from scrapling import Adaptor
85
-
86
- # Fetch a web page
87
- url = 'https://quotes.toscrape.com/'
88
- response = requests.get(url)
89
-
90
- # Create an Adaptor instance
91
- page = Adaptor(response.text, url=url)
92
- # Get all strings in the full page
93
- page.get_all_text(ignore_tags=('script', 'style'))
94
-
95
- # Get all quotes, any of these methods will return a list of strings (TextHandlers)
96
- quotes = page.css('.quote .text::text') # CSS selector
97
- quotes = page.xpath('//span[@class="text"]/text()') # XPath
98
- quotes = page.css('.quote').css('.text::text') # Chained selectors
99
- quotes = [element.text for element in page.css('.quote').css('.text')] # Slower than bulk query above
100
-
101
- # Get the first quote element
102
- quote = page.css('.quote').first # or [0] or .get()
103
-
104
- # Working with elements
105
- quote.html_content # Inner HTML
106
- quote.prettify() # Prettified version of Inner HTML
107
- quote.attrib # Element attributes
108
- quote.path # DOM path to element (List)
109
- ```
110
- To keep it simple, all methods can be chained on top of each other as long as you are chaining methods that return an element (It's called an `Adaptor` object) or a List of Adaptors (It's called `Adaptors` object)
111
-
112
- ### Installation
113
- Scrapling is a breeze to get started with - We only require at least Python 3.7 to work and the rest of the requirements are installed automatically with the package.
114
- ```bash
115
- # Using pip
116
- pip install scrapling
117
-
118
- # Or the latest from GitHub
119
- pip install git+https://github.com/D4Vinci/Scrapling.git@master
120
- ```
121
-
122
- ## Performance
123
-
124
- Scrapling isn't just powerful - it's also blazing fast. Scrapling implements many best practices, design patterns, and numerous optimizations to save fractions of seconds. All of that while focusing exclusively on parsing HTML documents.
125
- Here are benchmarks comparing Scrapling to popular Python libraries in two tests.
126
-
127
- ### Text Extraction Speed Test (5000 nested elements).
128
-
129
- | # | Library | Time (ms) | vs Scrapling |
130
- |---|:-----------------:|:---------:|:------------:|
131
- | 1 | Scrapling | 5.44 | 1.0x |
132
- | 2 | Parsel/Scrapy | 5.53 | 1.017x |
133
- | 3 | Raw Lxml | 6.76 | 1.243x |
134
- | 4 | PyQuery | 21.96 | 4.037x |
135
- | 5 | Selectolax | 67.12 | 12.338x |
136
- | 6 | BS4 with Lxml | 1307.03 | 240.263x |
137
- | 7 | MechanicalSoup | 1322.64 | 243.132x |
138
- | 8 | BS4 with html5lib | 3373.75 | 620.175x |
139
-
140
- As you see, Scrapling is on par with Scrapy and slightly faster than Lxml which both libraries are built on top of. These are the closest results to Scrapling. PyQuery is also built on top of Lxml but still, Scrapling is 4 times faster.
141
-
142
- ### Extraction By Text Speed Test
143
-
144
- | Library | Time (ms) | vs Scrapling |
145
- |:-----------:|:---------:|:------------:|
146
- | Scrapling | 2.51 | 1.0x |
147
- | AutoScraper | 11.41 | 4.546x |
148
-
149
- Scrapling can find elements with more methods and it returns full element `Adaptor` objects not only the text like AutoScraper. So, to make this test fair, both libraries will extract an element with text, find similar elements, and then extract the text content for all of them. As you see, Scrapling is still 4.5 times faster at same task.
150
-
151
- > All benchmarks' results are an average of 100 runs. See our [benchmarks.py](/benchmarks.py) for methodology and to run your comparisons.
152
-
153
- ## Advanced Features
154
- ### Smart Navigation
155
- ```python
156
- >>> quote.tag
157
- 'div'
158
-
159
- >>> quote.parent
160
- <data='<div class="col-md-8"> <div class="quote...' parent='<div class="row"> <div class="col-md-8">...'>
161
-
162
- >>> quote.parent.tag
163
- 'div'
164
-
165
- >>> quote.children
166
- [<data='<span class="text" itemprop="text">“The...' parent='<div class="quote" itemscope itemtype="h...'>,
167
- <data='<span>by <small class="author" itemprop=...' parent='<div class="quote" itemscope itemtype="h...'>,
168
- <data='<div class="tags"> Tags: <meta class="ke...' parent='<div class="quote" itemscope itemtype="h...'>]
169
-
170
- >>> quote.siblings
171
- [<data='<div class="quote" itemscope itemtype="h...' parent='<div class="col-md-8"> <div class="quote...'>,
172
- <data='<div class="quote" itemscope itemtype="h...' parent='<div class="col-md-8"> <div class="quote...'>,
173
- <data='<div class="quote" itemscope itemtype="h...' parent='<div class="col-md-8"> <div class="quote...'>,
174
- ...]
175
-
176
- >>> quote.next # gets the next element, the same logic applies to `quote.previous`
177
- <data='<div class="quote" itemscope itemtype="h...' parent='<div class="col-md-8"> <div class="quote...'>
178
-
179
- >>> quote.children.css(".author::text")
180
- ['Albert Einstein']
181
-
182
- >>> quote.has_class('quote')
183
- True
184
-
185
- # Generate new selectors for any element
186
- >>> quote.css_selector
187
- 'body > div > div:nth-of-type(2) > div > div'
188
-
189
- # Test these selectors on your favorite browser or reuse them again in the library in other methods!
190
- >>> quote.xpath_selector
191
- '//body/div/div[2]/div/div'
192
- ```
193
- If your case needs more than the element's parent, you can iterate over the whole ancestors' tree of any element like below
194
- ```python
195
- for ancestor in quote.iterancestors():
196
- # do something with it...
197
- ```
198
- You can search for a specific ancestor of an element that satisfies a function, all you need to do is to pass a function that takes an `Adaptor` object as an argument and return `True` if the condition satisfies or `False` otherwise like below:
199
- ```python
200
- >>> quote.find_ancestor(lambda ancestor: ancestor.has_class('row'))
201
- <data='<div class="row"> <div class="col-md-8">...' parent='<div class="container"> <div class="row...'>
202
- ```
203
-
204
- ### Content-based Selection & Finding Similar Elements
205
- You can select elements by their text content in multiple ways, here's a full example on another website:
206
- ```python
207
- >>> response = requests.get('https://books.toscrape.com/index.html')
208
-
209
- >>> page = Adaptor(response.text, url=response.url)
210
-
211
- >>> page.find_by_text('Tipping the Velvet') # Find the first element that its text fully matches this text
212
- <data='<a href="catalogue/tipping-the-velvet_99...' parent='<h3><a href="catalogue/tipping-the-velve...'>
213
-
214
- >>> page.find_by_text('Tipping the Velvet', first_match=False) # Get all matches if there are more
215
- [<data='<a href="catalogue/tipping-the-velvet_99...' parent='<h3><a href="catalogue/tipping-the-velve...'>]
216
-
217
- >>> page.find_by_regex(r'£[\d\.]+') # Get the first element that its text content matches my price regex
218
- <data='<p class="price_color">£51.77</p>' parent='<div class="product_price"> <p class="pr...'>
219
-
220
- >>> page.find_by_regex(r'£[\d\.]+', first_match=False) # Get all elements that matches my price regex
221
- [<data='<p class="price_color">£51.77</p>' parent='<div class="product_price"> <p class="pr...'>,
222
- <data='<p class="price_color">£53.74</p>' parent='<div class="product_price"> <p class="pr...'>,
223
- <data='<p class="price_color">£50.10</p>' parent='<div class="product_price"> <p class="pr...'>,
224
- <data='<p class="price_color">£47.82</p>' parent='<div class="product_price"> <p class="pr...'>,
225
- ...]
226
- ```
227
- Find all elements that are similar to the current element in location and attributes
228
- ```python
229
- # For this case, ignore the 'title' attribute while matching
230
- >>> page.find_by_text('Tipping the Velvet').find_similar(ignore_attributes=['title'])
231
- [<data='<a href="catalogue/a-light-in-the-attic_...' parent='<h3><a href="catalogue/a-light-in-the-at...'>,
232
- <data='<a href="catalogue/soumission_998/index....' parent='<h3><a href="catalogue/soumission_998/in...'>,
233
- <data='<a href="catalogue/sharp-objects_997/ind...' parent='<h3><a href="catalogue/sharp-objects_997...'>,
234
- ...]
235
-
236
- # You will notice that the number of elements is 19 not 20 because the current element is not included.
237
- >>> len(page.find_by_text('Tipping the Velvet').find_similar(ignore_attributes=['title']))
238
- 19
239
-
240
- # Get the `href` attribute from all similar elements
241
- >>> [element.attrib['href'] for element in page.find_by_text('Tipping the Velvet').find_similar(ignore_attributes=['title'])]
242
- ['catalogue/a-light-in-the-attic_1000/index.html',
243
- 'catalogue/soumission_998/index.html',
244
- 'catalogue/sharp-objects_997/index.html',
245
- ...]
246
- ```
247
- To increase the complexity a little bit, let's say we want to get all books' data using that element as a starting point for some reason
248
- ```python
249
- >>> for product in page.find_by_text('Tipping the Velvet').parent.parent.find_similar():
250
- print({
251
- "name": product.css('h3 a::text')[0],
252
- "price": product.css('.price_color')[0].re_first(r'[\d\.]+'),
253
- "stock": product.css('.availability::text')[-1].clean()
254
- })
255
- {'name': 'A Light in the ...', 'price': '51.77', 'stock': 'In stock'}
256
- {'name': 'Soumission', 'price': '50.10', 'stock': 'In stock'}
257
- {'name': 'Sharp Objects', 'price': '47.82', 'stock': 'In stock'}
258
- ...
259
- ```
260
- The [documentation](/docs/Examples) will provide more advanced examples.
261
-
262
- ### Handling Structural Changes
263
- > Because [the internet archive](https://web.archive.org/) is down at the time of writing this, I can't use real websites as examples even though I tested that before (I mean browsing an old version of a website and then counting the current version of the website as structural changes)
264
-
265
- Let's say you are scraping a page with a structure like this:
266
- ```html
267
- <div class="container">
268
- <section class="products">
269
- <article class="product" id="p1">
270
- <h3>Product 1</h3>
271
- <p class="description">Description 1</p>
272
- </article>
273
- <article class="product" id="p2">
274
- <h3>Product 2</h3>
275
- <p class="description">Description 2</p>
276
- </article>
277
- </section>
278
- </div>
279
- ```
280
- and you want to scrape the first product, the one with the `p1` ID. You will probably write a selector like this
281
- ```python
282
- page.css('#p1')
283
- ```
284
- When website owners implement structural changes like
285
- ```html
286
- <div class="new-container">
287
- <div class="product-wrapper">
288
- <section class="products">
289
- <article class="product new-class" data-id="p1">
290
- <div class="product-info">
291
- <h3>Product 1</h3>
292
- <p class="new-description">Description 1</p>
293
- </div>
294
- </article>
295
- <article class="product new-class" data-id="p2">
296
- <div class="product-info">
297
- <h3>Product 2</h3>
298
- <p class="new-description">Description 2</p>
299
- </div>
300
- </article>
301
- </section>
302
- </div>
303
- </div>
304
- ```
305
- The selector will no longer function and your code needs maintenance. That's where Scrapling auto-matching feature comes into play.
306
-
307
- ```python
308
- # Before the change
309
- page = Adaptor(page_source, url='example.com', auto_match=True)
310
- element = page.css('#p1' auto_save=True)
311
- if not element: # One day website changes?
312
- element = page.css('#p1', auto_match=True) # Still finds it!
313
- # the rest of the code...
314
- ```
315
- > How does the auto-matching work? Check the [FAQs](#FAQs) section for that and other possible issues while auto-matching.
316
-
317
- **Notes:**
318
- 1. Passing the `auto_save` argument without setting `auto_match` to `True` while initializing the Adaptor object will only result in ignoring the `auto_save` argument value and the following warning message
319
- ```text
320
- Argument `auto_save` will be ignored because `auto_match` wasn't enabled on initialization. Check docs for more info.
321
- ```
322
- This behavior is purely for performance reasons so the database gets created/connected only when you are planning to use the auto-matching features. Same case with the `auto_match` argument.
323
-
324
- 2. The `auto_match` parameter works only for `Adaptor` instances not `Adaptors` so if you do something like this you will get an error
325
- ```python
326
- page.css('body').css('#p1', auto_match=True)
327
- ```
328
- because you can't auto-match a whole list, you have to be specific and do something like
329
- ```python
330
- page.css('body')[0].css('#p1', auto_match=True)
331
- ```
332
-
333
- ### Is That All?
334
- Here's what else you can do with Scrapling:
335
-
336
- - Accessing the `lxml.etree` object itself of any element directly
337
- ```python
338
- >>> quote._root
339
- <Element div at 0x107f98870>
340
- ```
341
- - Saving and retrieving elements manually to auto-match them outside the `css` and the `xpath` methods but you have to set the identifier by yourself.
342
-
343
- - To save element to the database:
344
- ```python
345
- >>> element = page.find_by_text('Tipping the Velvet', first_match=True)
346
- >>> page.save(element, 'my_special_element')
347
- ```
348
- - Now later when you want to retrieve it and relocate it in the page with auto-matching, it would be like this
349
- ```python
350
- >>> element_dict = page.retrieve('my_special_element')
351
- >>> page.relocate(element_dict, adaptor_type=True)
352
- [<data='<a href="catalogue/tipping-the-velvet_99...' parent='<h3><a href="catalogue/tipping-the-velve...'>]
353
- >>> page.relocate(element_dict, adaptor_type=True).css('::text')
354
- ['Tipping the Velvet']
355
- ```
356
- - if you want to keep it as `lxml.etree` object, leave the `adaptor_type` argument
357
- ```python
358
- >>> page.relocate(element_dict)
359
- [<Element a at 0x105a2a7b0>]
360
- ```
361
-
362
- - Doing operations on element content is the same as scrapy
363
- ```python
364
- quote.re(r'somethings') # Get all strings (TextHandlers) that match the regex pattern
365
- quote.re_first(r'something') # Get the first string (TextHandler) only
366
- quote.json() # If the content text is jsonable, then convert it to json using `orjson` which is 10x faster than the standard json library and provides more options
367
- ```
368
- Hence all of these methods are actually methods from the `TextHandler` within that contains the text content so the same can be done directly if you call the `.text` property or equivalent selector function.
369
-
370
-
371
- - Doing operations on the text content itself includes
372
- - Cleaning the text from any white spaces and replacing consecutive spaces with single space
373
- ```python
374
- quote.clean()
375
- ```
376
- - You already know about the regex matching and the fast json parsing but did you know that all strings returned from the regex search are actually `TextHandler` objects too? so in cases where you have for example a JS object assigned to a JS variable inside JS code and want to extract it with regex and then convert it to json object, in other libraries, these would be more than 1 line of code but here you can do it in 1 line like this
377
- ```python
378
- page.xpath('//script/text()').re_first(r'var dataLayer = (.+);').json()
379
- ```
380
- - Sort all characters in the string as if it were a list and return the new string
381
- ```python
382
- quote.sort()
383
- ```
384
- > To be clear, `TextHandler` is a sub-class of Python's `str` so all normal operations/methods that work with Python strings will work with it.
385
-
386
- - Any element's attributes are not exactly a dictionary but a sub-class of [mapping](https://docs.python.org/3/glossary.html#term-mapping) called `AttributesHandler` that's read-only so it's faster and string values returned are actually `TextHandler` objects so all operations above can be done on them, standard dictionary operations that doesn't modify the data, and more :)
387
- - Unlike standard dictionaries, here you can search by values too and can do partial searches. It might be handy in some cases (returns a generator of matches)
388
- ```python
389
- >>> for item in element.attrib.search_values('catalogue', partial=True):
390
- print(item)
391
- {'href': 'catalogue/tipping-the-velvet_999/index.html'}
392
- ```
393
- - Serialize the current attributes to JSON bytes:
394
- ```python
395
- >>> element.attrib.json_string
396
- b'{"href":"catalogue/tipping-the-velvet_999/index.html","title":"Tipping the Velvet"}'
397
- ```
398
- - Converting it to a normal dictionary
399
- ```python
400
- >>> dict(element.attrib)
401
- {'href': 'catalogue/tipping-the-velvet_999/index.html',
402
- 'title': 'Tipping the Velvet'}
403
- ```
404
-
405
- Scrapling is under active development so expect many more features coming soon :)
406
-
407
- ## More Advanced Usage
408
-
409
- There are a lot of deep details skipped here to make this as short as possible so to take a deep dive, head to the [docs](/docs) section. I will try to keep it updated as possible and add complex examples. There I will explain points like how to write your storage system, write spiders that don't depend on selectors at all, and more...
410
-
411
- Note that implementing your storage system can be complex as there are some strict rules such as inheriting from the same abstract class, following the singleton design pattern used in other classes, and more. So make sure to read the docs first.
412
-
413
-
414
- ## FAQs
415
- This section addresses common questions about Scrapling, please read this section before opening an issue.
416
-
417
- ### How does auto-matching work?
418
- 1. You need to get a working selector and run it at least once with methods `css` or `xpath` with the `auto_save` parameter set to `True` before structural changes happen.
419
- 2. Before returning results for you, Scrapling uses its configured database and saves unique properties about that element.
420
- 3. Now because everything about the element can be changed or removed, nothing from the element can be used as a unique identifier for the database. To solve this issue, I made the storage system rely on two things:
421
- 1. The domain of the URL you gave while initializing the first Adaptor object
422
- 2. The `identifier` parameter you passed to the method while selecting. If you didn't pass one, then the selector string itself will be used as an identifier but remember you will have to use it as an identifier value later when the structure changes and you want to pass the new selector.
423
-
424
- Together both are used to retrieve the element's unique properties from the database later.
425
- 4. Now later when you enable the `auto_match` parameter for both the Adaptor instance and the method call. The element properties are retrieved and Scrapling loops over all elements in the page and compares each one's unique properties to the unique properties we already have for this element and a score is calculated for each one.
426
- 5. The comparison between elements is not exact but more about finding how similar these values are, so everything is taken into consideration even the values' order like the order in which the element class names were written before and the order in which the same element class names are written now.
427
- 6. The score for each element is stored in the table and in the end, the element(s) with the highest combined similarity scores are returned.
428
-
429
- ### How does the auto-matching work if I didn't pass a URL while initializing the Adaptor object?
430
- Not a big problem as it depends on your usage. The word `default` will be used in place of the URL field while saving the element's unique properties. So this will only be an issue if you used the same identifier later for a different website that you didn't pass the URL parameter while initializing it as well. The save process will overwrite the previous data and auto-matching uses the latest saved properties only.
431
-
432
- ### If all things about an element can change or get removed, what are the unique properties to be saved?
433
- For each element, Scrapling will extract:
434
- - Element tag name, text, attributes (names and values), siblings (tag names only), and path (tag names only).
435
- - Element's parent tag name, attributes (names and values), and text.
436
-
437
- ### I have enabled the `auto_save`/`auto_match` parameter while selecting and it got completely ignored with a warning message
438
- That's because passing the `auto_save`/`auto_match` argument without setting `auto_match` to `True` while initializing the Adaptor object will only result in ignoring the `auto_save`/`auto_match` argument value. This behavior is purely for performance reasons so the database gets created only when you are planning to use the auto-matching features.
439
-
440
- ### I have done everything as the docs but the auto-matching didn't return anything, what's wrong?
441
- It could be one of these reasons:
442
- 1. No data were saved/stored for this element before.
443
- 2. The selector passed is not the one used while storing element data. The solution is simple
444
- - Pass the old selector again as an identifier to the method called.
445
- - Retrieve the element with the retrieve method using the old selector as identifier then save it again with the save method and the new selector as identifier.
446
- - Start using the identifier argument more often if you are planning to use every new selector from now on.
447
- 3. The website had some extreme structural changes like a new full design. If this happens a lot with this website, the solution would be to make your code as selector-free as possible using Scrapling features.
448
-
449
- ### Can Scrapling replace code built on top of BeautifulSoup4?
450
- Pretty much yeah, almost all features you get from BeautifulSoup can be found or achieved in Scrapling one way or another. In fact, if you see there's a feature in bs4 that is missing in Scrapling, please make a feature request from the issues tab to let me know.
451
-
452
- ### Can Scrapling replace code built on top of AutoScraper?
453
- Of course, you can find elements by text/regex, find similar elements in a more reliable way than AutoScraper, and finally save/retrieve elements manually to use later as the model feature in AutoScraper. I have pulled all top articles about AutoScraper from Google and tested Scrapling against examples in them. In all examples, Scrapling got the same results as AutoScraper in much less time.
454
-
455
- ### Is Scrapling thread-safe?
456
- Yes, Scrapling instances are thread-safe. Each Adaptor instance maintains its own state.
457
-
458
- ## Sponsors
459
- [![Capsolver Banner](https://raw.githubusercontent.com/D4Vinci/Scrapling/main/images/CapSolver.png)](https://www.capsolver.com/?utm_source=github&utm_medium=repo&utm_campaign=scraping&utm_term=Scrapling)
460
-
461
- ## Contributing
462
- Everybody is invited and welcome to contribute to Scrapling. There is a lot to do!
463
-
464
- Please read the [contributing file](/CONTRIBUTING.md) before doing anything.
465
-
466
- ## License
467
- This work is licensed under BSD-3
468
-
469
- ## Acknowledgments
470
- This project includes code adapted from:
471
- - Parsel (BSD License) - Used for [translator](/scrapling/translator.py) submodule
472
-
473
- ## Known Issues
474
- - In the auto-matching save process, the unique properties of the first element from the selection results are the only ones that get saved. So if the selector you are using selects different elements on the page that are in different locations, auto-matching will probably return to you the first element only when you relocate it later. This doesn't include combined CSS selectors (Using commas to combine more than one selector for example) as these selectors get separated and each selector gets executed alone.
475
- - Currently, Scrapling is not compatible with async/await.
476
-
477
- <div align="center"><small>Made with ❤️ by Karim Shoair</small></div><br>
@@ -1,12 +0,0 @@
1
- scrapling/__init__.py,sha256=bxgmUv7rTGX8os8Spzxg4lDsNvVv1BWrHXQVDJu86r4,337
2
- scrapling/custom_types.py,sha256=D4bAwpm3JIjfw4I2FS0odcq371OQAAUTkBN4ajPWX-4,6001
3
- scrapling/mixins.py,sha256=m3ZvTyjt9JNtfZ4NYuaQsfi84UbFkatN_NEiLVZM7TY,2924
4
- scrapling/parser.py,sha256=HNDq9OQzU3zpn31AfmreVlZv9kHielb8XoElSwLzK34,44650
5
- scrapling/storage_adaptors.py,sha256=FEh0E8iss2ZDXAL1IDGxQVpEcdbLOupfmUwcsp_z-QY,6198
6
- scrapling/translator.py,sha256=E0oueabsELxTGSv91q6AQgQnsN4-oQ76mq1u0jR2Ofo,5410
7
- scrapling/utils.py,sha256=ApmNjCxxy-N_gAYRnzutLvPBvY_s9FTYp8UhdyeZXSc,5960
8
- scrapling-0.1.2.dist-info/LICENSE,sha256=XHgu8DRuT7_g3Hb9Q18YGg8eShp6axPBacbnQxT_WWQ,1499
9
- scrapling-0.1.2.dist-info/METADATA,sha256=-EErnYT0EABbPVgXR-eln3-XS-8haAlHDikNs3pZAKU,27357
10
- scrapling-0.1.2.dist-info/WHEEL,sha256=OVMc5UfuAQiSplgO0_WdW7vXVGAt9Hdd6qtN4HotdyA,91
11
- scrapling-0.1.2.dist-info/top_level.txt,sha256=Ud-yF-PC2U5HQ3nc5QwT7HSPdIpF1RuwQ_mYgBzHHIM,10
12
- scrapling-0.1.2.dist-info/RECORD,,