scrapling 0.1.2__py3-none-any.whl → 0.2.1__py3-none-any.whl

Sign up to get free protection for your applications and to get access to all the features.
Files changed (35) hide show
  1. scrapling/__init__.py +4 -3
  2. scrapling/core/__init__.py +0 -0
  3. scrapling/core/_types.py +25 -0
  4. scrapling/{custom_types.py → core/custom_types.py} +48 -3
  5. scrapling/{mixins.py → core/mixins.py} +22 -7
  6. scrapling/{storage_adaptors.py → core/storage_adaptors.py} +2 -2
  7. scrapling/{translator.py → core/translator.py} +2 -12
  8. scrapling/{utils.py → core/utils.py} +14 -61
  9. scrapling/engines/__init__.py +7 -0
  10. scrapling/engines/camo.py +128 -0
  11. scrapling/engines/constants.py +108 -0
  12. scrapling/engines/pw.py +237 -0
  13. scrapling/engines/static.py +112 -0
  14. scrapling/engines/toolbelt/__init__.py +19 -0
  15. scrapling/engines/toolbelt/custom.py +154 -0
  16. scrapling/engines/toolbelt/fingerprints.py +81 -0
  17. scrapling/engines/toolbelt/navigation.py +108 -0
  18. scrapling/fetchers.py +198 -0
  19. scrapling/parser.py +223 -70
  20. scrapling/py.typed +1 -0
  21. scrapling-0.2.1.dist-info/METADATA +835 -0
  22. scrapling-0.2.1.dist-info/RECORD +33 -0
  23. {scrapling-0.1.2.dist-info → scrapling-0.2.1.dist-info}/WHEEL +1 -1
  24. {scrapling-0.1.2.dist-info → scrapling-0.2.1.dist-info}/top_level.txt +1 -0
  25. tests/__init__.py +1 -0
  26. tests/fetchers/__init__.py +1 -0
  27. tests/fetchers/test_camoufox.py +62 -0
  28. tests/fetchers/test_httpx.py +67 -0
  29. tests/fetchers/test_playwright.py +74 -0
  30. tests/parser/__init__.py +0 -0
  31. tests/parser/test_automatch.py +56 -0
  32. tests/parser/test_general.py +286 -0
  33. scrapling-0.1.2.dist-info/METADATA +0 -477
  34. scrapling-0.1.2.dist-info/RECORD +0 -12
  35. {scrapling-0.1.2.dist-info → scrapling-0.2.1.dist-info}/LICENSE +0 -0
@@ -0,0 +1,835 @@
1
+ Metadata-Version: 2.1
2
+ Name: scrapling
3
+ Version: 0.2.1
4
+ Summary: Scrapling is a powerful, flexible, and high-performance web scraping library for Python. It
5
+ Home-page: https://github.com/D4Vinci/Scrapling
6
+ Author: Karim Shoair
7
+ Author-email: karim.shoair@pm.me
8
+ License: BSD
9
+ Project-URL: Documentation, https://github.com/D4Vinci/Scrapling/tree/main/docs
10
+ Project-URL: Source, https://github.com/D4Vinci/Scrapling
11
+ Project-URL: Tracker, https://github.com/D4Vinci/Scrapling/issues
12
+ Classifier: Operating System :: OS Independent
13
+ Classifier: Development Status :: 4 - Beta
14
+ Classifier: Intended Audience :: Developers
15
+ Classifier: License :: OSI Approved :: BSD License
16
+ Classifier: Natural Language :: English
17
+ Classifier: Topic :: Internet :: WWW/HTTP
18
+ Classifier: Topic :: Text Processing :: Markup
19
+ Classifier: Topic :: Internet :: WWW/HTTP :: Browsers
20
+ Classifier: Topic :: Text Processing :: Markup :: HTML
21
+ Classifier: Topic :: Software Development :: Libraries :: Python Modules
22
+ Classifier: Programming Language :: Python :: 3
23
+ Classifier: Programming Language :: Python :: 3 :: Only
24
+ Classifier: Programming Language :: Python :: 3.8
25
+ Classifier: Programming Language :: Python :: 3.9
26
+ Classifier: Programming Language :: Python :: 3.10
27
+ Classifier: Programming Language :: Python :: 3.11
28
+ Classifier: Programming Language :: Python :: 3.12
29
+ Classifier: Programming Language :: Python :: 3.13
30
+ Classifier: Programming Language :: Python :: Implementation :: CPython
31
+ Classifier: Typing :: Typed
32
+ Requires-Python: >=3.8
33
+ Description-Content-Type: text/markdown
34
+ License-File: LICENSE
35
+ Requires-Dist: requests >=2.3
36
+ Requires-Dist: lxml >=4.5
37
+ Requires-Dist: cssselect >=1.2
38
+ Requires-Dist: w3lib
39
+ Requires-Dist: orjson >=3
40
+ Requires-Dist: tldextract
41
+ Requires-Dist: httpx[brotli,zstd]
42
+ Requires-Dist: playwright
43
+ Requires-Dist: rebrowser-playwright
44
+ Requires-Dist: camoufox >=0.3.9
45
+ Requires-Dist: browserforge
46
+
47
+ # 🕷️ Scrapling: Undetectable, Lightning-Fast, and Adaptive Web Scraping for Python
48
+ [![Tests](https://github.com/D4Vinci/Scrapling/actions/workflows/tests.yml/badge.svg)](https://github.com/D4Vinci/Scrapling/actions/workflows/tests.yml) [![PyPI version](https://badge.fury.io/py/Scrapling.svg)](https://badge.fury.io/py/Scrapling) [![Supported Python versions](https://img.shields.io/pypi/pyversions/scrapling.svg)](https://pypi.org/project/scrapling/) [![PyPI Downloads](https://static.pepy.tech/badge/scrapling)](https://pepy.tech/project/scrapling)
49
+
50
+ Dealing with failing web scrapers due to anti-bot protections or website changes? Meet Scrapling.
51
+
52
+ Scrapling is a high-performance, intelligent web scraping library for Python that automatically adapts to website changes while significantly outperforming popular alternatives. For both beginners and experts, Scrapling provides powerful features while maintaining simplicity.
53
+
54
+ ```python
55
+ >> from scrapling import Fetcher, StealthyFetcher, PlayWrightFetcher
56
+ # Fetch websites' source under the radar!
57
+ >> page = StealthyFetcher().fetch('https://example.com', headless=True, network_idle=True)
58
+ >> print(page.status)
59
+ 200
60
+ >> products = page.css('.product', auto_save=True) # Scrape data that survives website design changes!
61
+ >> # Later, if the website structure changes, pass `auto_match=True`
62
+ >> products = page.css('.product', auto_match=True) # and Scrapling still finds them!
63
+ ```
64
+
65
+ # Sponsors
66
+
67
+ [Evomi](https://evomi.com?utm_source=github&utm_medium=banner&utm_campaign=d4vinci-scrapling) is your Swiss Quality Proxy Provider, starting at **$0.49/GB**
68
+
69
+ - 👩‍💻 **$0.49 per GB Residential Proxies**: Our price is unbeatable
70
+ - 👩‍💻 **24/7 Expert Support**: We will join your Slack Channel
71
+ - 🌍 **Global Presence**: Available in 150+ Countries
72
+ - ⚡ **Low Latency**
73
+ - 🔒 **Swiss Quality and Privacy**
74
+ - 🎁 **Free Trial**
75
+ - 🛡️ **99.9% Uptime**
76
+ - 🤝 **Special IP Pool selection**: Optimize for fast, quality or quantity of ips
77
+ - 🔧 **Easy Integration**: Compatible with most software and programming languages
78
+
79
+ [![Evomi Banner](https://my.evomi.com/images/brand/cta.png)](https://evomi.com?utm_source=github&utm_medium=banner&utm_campaign=d4vinci-scrapling)
80
+ ---
81
+
82
+ ## Table of content
83
+ * [Key Features](#key-features)
84
+ * [Fetch websites as you prefer](#fetch-websites-as-you-prefer)
85
+ * [Adaptive Scraping](#adaptive-scraping)
86
+ * [Performance](#performance)
87
+ * [Developing Experience](#developing-experience)
88
+ * [Getting Started](#getting-started)
89
+ * [Parsing Performance](#parsing-performance)
90
+ * [Text Extraction Speed Test (5000 nested elements).](#text-extraction-speed-test-5000-nested-elements)
91
+ * [Extraction By Text Speed Test](#extraction-by-text-speed-test)
92
+ * [Installation](#installation)
93
+ * [Fetching Websites Features](#fetching-websites-features)
94
+ * [Fetcher](#fetcher)
95
+ * [StealthyFetcher](#stealthyfetcher)
96
+ * [PlayWrightFetcher](#playwrightfetcher)
97
+ * [Advanced Parsing Features](#advanced-parsing-features)
98
+ * [Smart Navigation](#smart-navigation)
99
+ * [Content-based Selection & Finding Similar Elements](#content-based-selection--finding-similar-elements)
100
+ * [Handling Structural Changes](#handling-structural-changes)
101
+ * [Real World Scenario](#real-world-scenario)
102
+ * [Find elements by filters](#find-elements-by-filters)
103
+ * [Is That All?](#is-that-all)
104
+ * [More Advanced Usage](#more-advanced-usage)
105
+ * [⚡ Enlightening Questions and FAQs](#-enlightening-questions-and-faqs)
106
+ * [How does auto-matching work?](#how-does-auto-matching-work)
107
+ * [How does the auto-matching work if I didn't pass a URL while initializing the Adaptor object?](#how-does-the-auto-matching-work-if-i-didnt-pass-a-url-while-initializing-the-adaptor-object)
108
+ * [If all things about an element can change or get removed, what are the unique properties to be saved?](#if-all-things-about-an-element-can-change-or-get-removed-what-are-the-unique-properties-to-be-saved)
109
+ * [I have enabled the `auto_save`/`auto_match` parameter while selecting and it got completely ignored with a warning message](#i-have-enabled-the-auto_saveauto_match-parameter-while-selecting-and-it-got-completely-ignored-with-a-warning-message)
110
+ * [I have done everything as the docs but the auto-matching didn't return anything, what's wrong?](#i-have-done-everything-as-the-docs-but-the-auto-matching-didnt-return-anything-whats-wrong)
111
+ * [Can Scrapling replace code built on top of BeautifulSoup4?](#can-scrapling-replace-code-built-on-top-of-beautifulsoup4)
112
+ * [Can Scrapling replace code built on top of AutoScraper?](#can-scrapling-replace-code-built-on-top-of-autoscraper)
113
+ * [Is Scrapling thread-safe?](#is-scrapling-thread-safe)
114
+ * [More Sponsors!](#more-sponsors)
115
+ * [Contributing](#contributing)
116
+ * [Disclaimer for Scrapling Project](#disclaimer-for-scrapling-project)
117
+ * [License](#license)
118
+ * [Acknowledgments](#acknowledgments)
119
+ * [Thanks and References](#thanks-and-references)
120
+ * [Known Issues](#known-issues)
121
+
122
+ ## Key Features
123
+
124
+ ### Fetch websites as you prefer
125
+ - **HTTP requests**: Stealthy and fast HTTP requests with `Fetcher`
126
+ - **Stealthy fetcher**: Annoying anti-bot protection? No problem! Scrapling can bypass almost all of them with `StealthyFetcher` with default configuration!
127
+ - **Your preferred browser**: Use your real browser with CDP, [NSTbrowser](https://app.nstbrowser.io/r/1vO5e5)'s browserless, PlayWright with stealth mode, or even vanilla PlayWright - All is possible with `PlayWrightFetcher`!
128
+
129
+ ### Adaptive Scraping
130
+ - 🔄 **Smart Element Tracking**: Locate previously identified elements after website structure changes, using an intelligent similarity system and integrated storage.
131
+ - 🎯 **Flexible Querying**: Use CSS selectors, XPath, Elements filters, text search, or regex - chain them however you want!
132
+ - 🔍 **Find Similar Elements**: Automatically locate elements similar to the element you want on the page (Ex: other products like the product you found on the page).
133
+ - 🧠 **Smart Content Scraping**: Extract data from multiple websites without specific selectors using Scrapling powerful features.
134
+
135
+ ### Performance
136
+ - 🚀 **Lightning Fast**: Built from the ground up with performance in mind, outperforming most popular Python scraping libraries (outperforming BeautifulSoup in parsing by up to 620x in our tests).
137
+ - 🔋 **Memory Efficient**: Optimized data structures for minimal memory footprint.
138
+ - ⚡ **Fast JSON serialization**: 10x faster JSON serialization than the standard json library with more options.
139
+
140
+ ### Developing Experience
141
+ - 🛠️ **Powerful Navigation API**: Traverse the DOM tree easily in all directions and get the info you want (parent, ancestors, sibling, children, next/previous element, and more).
142
+ - 🧬 **Rich Text Processing**: All strings have built-in methods for regex matching, cleaning, and more. All elements' attributes are read-only dictionaries that are faster than standard dictionaries with added methods.
143
+ - 📝 **Automatic Selector Generation**: Create robust CSS/XPath selectors for any element.
144
+ - 🔌 **API Similar to Scrapy/BeautifulSoup**: Familiar methods and similar pseudo-elements for Scrapy and BeautifulSoup users.
145
+ - 📘 **Type hints and test coverage**: Complete type coverage and almost full test coverage for better IDE support and fewer bugs, respectively.
146
+
147
+ ## Getting Started
148
+
149
+ ```python
150
+ from scrapling import Fetcher
151
+
152
+ fetcher = Fetcher(auto_match=False)
153
+
154
+ # Fetch a web page and create an Adaptor instance
155
+ page = fetcher.get('https://quotes.toscrape.com/', stealthy_headers=True)
156
+ # Get all strings in the full page
157
+ page.get_all_text(ignore_tags=('script', 'style'))
158
+
159
+ # Get all quotes, any of these methods will return a list of strings (TextHandlers)
160
+ quotes = page.css('.quote .text::text') # CSS selector
161
+ quotes = page.xpath('//span[@class="text"]/text()') # XPath
162
+ quotes = page.css('.quote').css('.text::text') # Chained selectors
163
+ quotes = [element.text for element in page.css('.quote .text')] # Slower than bulk query above
164
+
165
+ # Get the first quote element
166
+ quote = page.css_first('.quote') # / page.css('.quote').first / page.css('.quote')[0]
167
+
168
+ # Tired of selectors? Use find_all/find
169
+ quotes = page.find_all('div', {'class': 'quote'})
170
+ # Same as
171
+ quotes = page.find_all('div', class_='quote')
172
+ quotes = page.find_all(['div'], class_='quote')
173
+ quotes = page.find_all(class_='quote') # and so on...
174
+
175
+ # Working with elements
176
+ quote.html_content # Inner HTML
177
+ quote.prettify() # Prettified version of Inner HTML
178
+ quote.attrib # Element attributes
179
+ quote.path # DOM path to element (List)
180
+ ```
181
+ To keep it simple, all methods can be chained on top of each other!
182
+
183
+ ## Parsing Performance
184
+
185
+ Scrapling isn't just powerful - it's also blazing fast. Scrapling implements many best practices, design patterns, and numerous optimizations to save fractions of seconds. All of that while focusing exclusively on parsing HTML documents.
186
+ Here are benchmarks comparing Scrapling to popular Python libraries in two tests.
187
+
188
+ ### Text Extraction Speed Test (5000 nested elements).
189
+
190
+ | # | Library | Time (ms) | vs Scrapling |
191
+ |---|:-----------------:|:---------:|:------------:|
192
+ | 1 | Scrapling | 5.44 | 1.0x |
193
+ | 2 | Parsel/Scrapy | 5.53 | 1.017x |
194
+ | 3 | Raw Lxml | 6.76 | 1.243x |
195
+ | 4 | PyQuery | 21.96 | 4.037x |
196
+ | 5 | Selectolax | 67.12 | 12.338x |
197
+ | 6 | BS4 with Lxml | 1307.03 | 240.263x |
198
+ | 7 | MechanicalSoup | 1322.64 | 243.132x |
199
+ | 8 | BS4 with html5lib | 3373.75 | 620.175x |
200
+
201
+ As you see, Scrapling is on par with Scrapy and slightly faster than Lxml which both libraries are built on top of. These are the closest results to Scrapling. PyQuery is also built on top of Lxml but still, Scrapling is 4 times faster.
202
+
203
+ ### Extraction By Text Speed Test
204
+
205
+ | Library | Time (ms) | vs Scrapling |
206
+ |:-----------:|:---------:|:------------:|
207
+ | Scrapling | 2.51 | 1.0x |
208
+ | AutoScraper | 11.41 | 4.546x |
209
+
210
+ Scrapling can find elements with more methods and it returns full element `Adaptor` objects not only the text like AutoScraper. So, to make this test fair, both libraries will extract an element with text, find similar elements, and then extract the text content for all of them. As you see, Scrapling is still 4.5 times faster at the same task.
211
+
212
+ > All benchmarks' results are an average of 100 runs. See our [benchmarks.py](https://github.com/D4Vinci/Scrapling/blob/main/benchmarks.py) for methodology and to run your comparisons.
213
+
214
+ ## Installation
215
+ Scrapling is a breeze to get started with - Starting from version 0.2, we require at least Python 3.8 to work.
216
+ ```bash
217
+ pip3 install scrapling
218
+ ```
219
+ - For using the `StealthyFetcher`, go to the command line and download the browser with
220
+ <details><summary>Windows OS</summary>
221
+
222
+ ```bash
223
+ camoufox fetch --browserforge
224
+ ```
225
+ </details>
226
+ <details><summary>MacOS</summary>
227
+
228
+ ```bash
229
+ python3 -m camoufox fetch --browserforge
230
+ ```
231
+ </details>
232
+ <details><summary>Linux</summary>
233
+
234
+ ```bash
235
+ python -m camoufox fetch --browserforge
236
+ ```
237
+ On a fresh installation of Linux, you may also need the following Firefox dependencies:
238
+ - Debian-based distros
239
+ ```bash
240
+ sudo apt install -y libgtk-3-0 libx11-xcb1 libasound2
241
+ ```
242
+ - Arch-based distros
243
+ ```bash
244
+ sudo pacman -S gtk3 libx11 libxcb cairo libasound alsa-lib
245
+ ```
246
+ </details>
247
+
248
+ <small> See the official <a href="https://camoufox.com/python/installation/#download-the-browser">Camoufox documentation</a> for more info on installation</small>
249
+
250
+ - If you are going to use the `PlayWrightFetcher` options, then install Playwright's Chromium browser with:
251
+ ```commandline
252
+ playwright install chromium
253
+ ```
254
+ - If you are going to use normal requests only with the `Fetcher` class then update the fingerprints files with:
255
+ ```commandline
256
+ python -m browserforge update
257
+ ```
258
+
259
+ ## Fetching Websites Features
260
+ All fetcher-type classes are imported in the same way
261
+ ```python
262
+ from scrapling import Fetcher, StealthyFetcher, PlayWrightFetcher
263
+ ```
264
+ And all of them can take these initialization arguments: `auto_match`, `huge_tree`, `keep_comments`, `storage`, `storage_args`, and `debug` which are the same ones you give to the `Adaptor` class.
265
+
266
+ Also, the `Response` object returned from all fetchers is the same as `Adaptor` object except it has these added attributes: `status`, `reason`, `cookies`, `headers`, and `request_headers`. All `cookies`, `headers`, and `request_headers` are always of type `dictionary`.
267
+ > [!NOTE]
268
+ > The `auto_match` argument is enabled by default which is the one you should care about the most as you will see later.
269
+ ### Fetcher
270
+ This class is built on top of [httpx](https://www.python-httpx.org/) with additional configuration options, here you can do `GET`, `POST`, `PUT`, and `DELETE` requests.
271
+
272
+ For all methods, you have `stealth_headers` which makes `Fetcher` create and use real browser's headers then create a referer header as if this request came from Google's search of this URL's domain. It's enabled by default.
273
+ ```python
274
+ >> page = Fetcher().get('https://httpbin.org/get', stealth_headers=True, follow_redirects=True)
275
+ >> page = Fetcher().post('https://httpbin.org/post', data={'key': 'value'})
276
+ >> page = Fetcher().put('https://httpbin.org/put', data={'key': 'value'})
277
+ >> page = Fetcher().delete('https://httpbin.org/delete')
278
+ ```
279
+ ### StealthyFetcher
280
+ This class is built on top of [Camoufox](https://github.com/daijro/camoufox) which by default bypasses most of the anti-bot protections. Scrapling adds extra layers of flavors and configurations to increase performance and undetectability even further.
281
+ ```python
282
+ >> page = StealthyFetcher().fetch('https://www.browserscan.net/bot-detection') # Running headless by default
283
+ >> page.status == 200
284
+ True
285
+ ```
286
+ > Note: all requests done by this fetcher is waiting by default for all JS to be fully loaded and executed so you don't have to :)
287
+
288
+ <details><summary><strong>For the sake of simplicity, expand this for the complete list of arguments</strong></summary>
289
+
290
+ | Argument | Description | Optional |
291
+ |:-------------------:|-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|:--------:|
292
+ | url | Target url | ❌ |
293
+ | headless | Pass `True` to run the browser in headless/hidden (**default**), `virtual` to run it in virtual screen mode, or `False` for headful/visible mode. The `virtual` mode requires having `xvfb` installed. | ✔️ |
294
+ | block_images | Prevent the loading of images through Firefox preferences. _This can help save your proxy usage but be careful with this option as it makes some websites never finish loading._ | ✔️ |
295
+ | disable_resources | Drop requests of unnecessary resources for a speed boost. It depends but it made requests ~25% faster in my tests for some websites.<br/>Requests dropped are of type `font`, `image`, `media`, `beacon`, `object`, `imageset`, `texttrack`, `websocket`, `csp_report`, and `stylesheet`. _This can help save your proxy usage but be careful with this option as it makes some websites never finish loading._ | ✔️ |
296
+ | google_search | Enabled by default, Scrapling will set the referer header to be as if this request came from a Google search for this website's domain name. | ✔️ |
297
+ | extra_headers | A dictionary of extra headers to add to the request. _The referer set by the `google_search` argument takes priority over the referer set here if used together._ | ✔️ |
298
+ | block_webrtc | Blocks WebRTC entirely. | ✔️ |
299
+ | page_action | Added for automation. A function that takes the `page` object, does the automation you need, then returns `page` again. | ✔️ |
300
+ | addons | List of Firefox addons to use. **Must be paths to extracted addons.** | ✔️ |
301
+ | humanize | Humanize the cursor movement. Takes either True or the MAX duration in seconds of the cursor movement. The cursor typically takes up to 1.5 seconds to move across the window. | ✔️ |
302
+ | allow_webgl | Whether to allow WebGL. To prevent leaks, only use this for special cases. | ✔️ |
303
+ | network_idle | Wait for the page until there are no network connections for at least 500 ms. | ✔️ |
304
+ | timeout | The timeout in milliseconds that is used in all operations and waits through the page. The default is 30000. | ✔️ |
305
+ | wait_selector | Wait for a specific css selector to be in a specific state. | ✔️ |
306
+ | proxy | The proxy to be used with requests, it can be a string or a dictionary with the keys 'server', 'username', and 'password' only. | ✔️ |
307
+ | os_randomize | If enabled, Scrapling will randomize the OS fingerprints used. The default is Scrapling matching the fingerprints with the current OS. | ✔️ |
308
+ | wait_selector_state | The state to wait for the selector given with `wait_selector`. _Default state is `attached`._ | ✔️ |
309
+
310
+ </details>
311
+
312
+ This list isn't final so expect a lot more additions and flexibility to be added in the next versions!
313
+
314
+ ### PlayWrightFetcher
315
+ This class is built on top of [Playwright](https://playwright.dev/python/) which currently provides 4 main run options but they can be mixed as you want.
316
+ ```python
317
+ >> page = PlayWrightFetcher().fetch('https://www.google.com/search?q=%22Scrapling%22', disable_resources=True) # Vanilla Playwright option
318
+ >> page.css_first("#search a::attr(href)")
319
+ 'https://github.com/D4Vinci/Scrapling'
320
+ ```
321
+ > Note: all requests done by this fetcher is waiting by default for all JS to be fully loaded and executed so you don't have to :)
322
+
323
+ Using this Fetcher class, you can make requests with:
324
+ 1) Vanilla Playwright without any modifications other than the ones you chose.
325
+ 2) Stealthy Playwright with the stealth mode I wrote for it. It's still a WIP but it bypasses many online tests like [Sannysoft's](https://bot.sannysoft.com/).</br> Some of the things this fetcher's stealth mode does include:
326
+ * Patching the CDP runtime fingerprint.
327
+ * Mimics some of the real browsers' properties by injecting several JS files and using custom options.
328
+ * Using custom flags on launch to hide Playwright even more and make it faster.
329
+ * Generates real browser's headers of the same type and same user OS then append it to the request's headers.
330
+ 3) Real browsers by passing the CDP URL of your browser to be controlled by the Fetcher and most of the options can be enabled on it.
331
+ 4) [NSTBrowser](https://app.nstbrowser.io/r/1vO5e5)'s [docker browserless](https://hub.docker.com/r/nstbrowser/browserless) option by passing the CDP URL and enabling `nstbrowser_mode` option.
332
+
333
+ Add that to a lot of controlling/hiding options as you will see in the arguments list below.
334
+
335
+ <details><summary><strong>Expand this for the complete list of arguments</strong></summary>
336
+
337
+ | Argument | Description | Optional |
338
+ |:-------------------:|-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|:--------:|
339
+ | url | Target url | ❌ |
340
+ | headless | Pass `True` to run the browser in headless/hidden (**default**), or `False` for headful/visible mode. | ✔️ |
341
+ | disable_resources | Drop requests of unnecessary resources for a speed boost. It depends but it made requests ~25% faster in my tests for some websites.<br/>Requests dropped are of type `font`, `image`, `media`, `beacon`, `object`, `imageset`, `texttrack`, `websocket`, `csp_report`, and `stylesheet`. _This can help save your proxy usage but be careful with this option as it makes some websites never finish loading._ | ✔️ |
342
+ | useragent | Pass a useragent string to be used. **Otherwise the fetcher will generate a real Useragent of the same browser and use it.** | ✔️ |
343
+ | network_idle | Wait for the page until there are no network connections for at least 500 ms. | ✔️ |
344
+ | timeout | The timeout in milliseconds that is used in all operations and waits through the page. The default is 30000. | ✔️ |
345
+ | page_action | Added for automation. A function that takes the `page` object, does the automation you need, then returns `page` again. | ✔️ |
346
+ | wait_selector | Wait for a specific css selector to be in a specific state. | ✔️ |
347
+ | wait_selector_state | The state to wait for the selector given with `wait_selector`. _Default state is `attached`._ | ✔️ |
348
+ | google_search | Enabled by default, Scrapling will set the referer header to be as if this request came from a Google search for this website's domain name. | ✔️ |
349
+ | extra_headers | A dictionary of extra headers to add to the request. The referer set by the `google_search` argument takes priority over the referer set here if used together. | ✔️ |
350
+ | proxy | The proxy to be used with requests, it can be a string or a dictionary with the keys 'server', 'username', and 'password' only. | ✔️ |
351
+ | hide_canvas | Add random noise to canvas operations to prevent fingerprinting. | ✔️ |
352
+ | disable_webgl | Disables WebGL and WebGL 2.0 support entirely. | ✔️ |
353
+ | stealth | Enables stealth mode, always check the documentation to see what stealth mode does currently. | ✔️ |
354
+ | cdp_url | Instead of launching a new browser instance, connect to this CDP URL to control real browsers/NSTBrowser through CDP. | ✔️ |
355
+ | nstbrowser_mode | Enables NSTBrowser mode, **it have to be used with `cdp_url` argument or it will get completely ignored.** | ✔️ |
356
+ | nstbrowser_config | The config you want to send with requests to the NSTBrowser. _If left empty, Scrapling defaults to an optimized NSTBrowser's docker browserless config._ | ✔️ |
357
+
358
+ </details>
359
+
360
+ This list isn't final so expect a lot more additions and flexibility to be added in the next versions!
361
+
362
+ ## Advanced Parsing Features
363
+ ### Smart Navigation
364
+ ```python
365
+ >>> quote.tag
366
+ 'div'
367
+
368
+ >>> quote.parent
369
+ <data='<div class="col-md-8"> <div class="quote...' parent='<div class="row"> <div class="col-md-8">...'>
370
+
371
+ >>> quote.parent.tag
372
+ 'div'
373
+
374
+ >>> quote.children
375
+ [<data='<span class="text" itemprop="text">“The...' parent='<div class="quote" itemscope itemtype="h...'>,
376
+ <data='<span>by <small class="author" itemprop=...' parent='<div class="quote" itemscope itemtype="h...'>,
377
+ <data='<div class="tags"> Tags: <meta class="ke...' parent='<div class="quote" itemscope itemtype="h...'>]
378
+
379
+ >>> quote.siblings
380
+ [<data='<div class="quote" itemscope itemtype="h...' parent='<div class="col-md-8"> <div class="quote...'>,
381
+ <data='<div class="quote" itemscope itemtype="h...' parent='<div class="col-md-8"> <div class="quote...'>,
382
+ ...]
383
+
384
+ >>> quote.next # gets the next element, the same logic applies to `quote.previous`
385
+ <data='<div class="quote" itemscope itemtype="h...' parent='<div class="col-md-8"> <div class="quote...'>
386
+
387
+ >>> quote.children.css_first(".author::text")
388
+ 'Albert Einstein'
389
+
390
+ >>> quote.has_class('quote')
391
+ True
392
+
393
+ # Generate new selectors for any element
394
+ >>> quote.generate_css_selector
395
+ 'body > div > div:nth-of-type(2) > div > div'
396
+
397
+ # Test these selectors on your favorite browser or reuse them again in the library's methods!
398
+ >>> quote.generate_xpath_selector
399
+ '//body/div/div[2]/div/div'
400
+ ```
401
+ If your case needs more than the element's parent, you can iterate over the whole ancestors' tree of any element like below
402
+ ```python
403
+ for ancestor in quote.iterancestors():
404
+ # do something with it...
405
+ ```
406
+ You can search for a specific ancestor of an element that satisfies a function, all you need to do is to pass a function that takes an `Adaptor` object as an argument and return `True` if the condition satisfies or `False` otherwise like below:
407
+ ```python
408
+ >>> quote.find_ancestor(lambda ancestor: ancestor.has_class('row'))
409
+ <data='<div class="row"> <div class="col-md-8">...' parent='<div class="container"> <div class="row...'>
410
+ ```
411
+
412
+ ### Content-based Selection & Finding Similar Elements
413
+ You can select elements by their text content in multiple ways, here's a full example on another website:
414
+ ```python
415
+ >>> page = Fetcher().get('https://books.toscrape.com/index.html')
416
+
417
+ >>> page.find_by_text('Tipping the Velvet') # Find the first element whose text fully matches this text
418
+ <data='<a href="catalogue/tipping-the-velvet_99...' parent='<h3><a href="catalogue/tipping-the-velve...'>
419
+
420
+ >>> page.find_by_text('Tipping the Velvet', first_match=False) # Get all matches if there are more
421
+ [<data='<a href="catalogue/tipping-the-velvet_99...' parent='<h3><a href="catalogue/tipping-the-velve...'>]
422
+
423
+ >>> page.find_by_regex(r'£[\d\.]+') # Get the first element that its text content matches my price regex
424
+ <data='<p class="price_color">£51.77</p>' parent='<div class="product_price"> <p class="pr...'>
425
+
426
+ >>> page.find_by_regex(r'£[\d\.]+', first_match=False) # Get all elements that matches my price regex
427
+ [<data='<p class="price_color">£51.77</p>' parent='<div class="product_price"> <p class="pr...'>,
428
+ <data='<p class="price_color">£53.74</p>' parent='<div class="product_price"> <p class="pr...'>,
429
+ <data='<p class="price_color">£50.10</p>' parent='<div class="product_price"> <p class="pr...'>,
430
+ <data='<p class="price_color">£47.82</p>' parent='<div class="product_price"> <p class="pr...'>,
431
+ ...]
432
+ ```
433
+ Find all elements that are similar to the current element in location and attributes
434
+ ```python
435
+ # For this case, ignore the 'title' attribute while matching
436
+ >>> page.find_by_text('Tipping the Velvet').find_similar(ignore_attributes=['title'])
437
+ [<data='<a href="catalogue/a-light-in-the-attic_...' parent='<h3><a href="catalogue/a-light-in-the-at...'>,
438
+ <data='<a href="catalogue/soumission_998/index....' parent='<h3><a href="catalogue/soumission_998/in...'>,
439
+ <data='<a href="catalogue/sharp-objects_997/ind...' parent='<h3><a href="catalogue/sharp-objects_997...'>,
440
+ ...]
441
+
442
+ # You will notice that the number of elements is 19 not 20 because the current element is not included.
443
+ >>> len(page.find_by_text('Tipping the Velvet').find_similar(ignore_attributes=['title']))
444
+ 19
445
+
446
+ # Get the `href` attribute from all similar elements
447
+ >>> [element.attrib['href'] for element in page.find_by_text('Tipping the Velvet').find_similar(ignore_attributes=['title'])]
448
+ ['catalogue/a-light-in-the-attic_1000/index.html',
449
+ 'catalogue/soumission_998/index.html',
450
+ 'catalogue/sharp-objects_997/index.html',
451
+ ...]
452
+ ```
453
+ To increase the complexity a little bit, let's say we want to get all books' data using that element as a starting point for some reason
454
+ ```python
455
+ >>> for product in page.find_by_text('Tipping the Velvet').parent.parent.find_similar():
456
+ print({
457
+ "name": product.css_first('h3 a::text'),
458
+ "price": product.css_first('.price_color').re_first(r'[\d\.]+'),
459
+ "stock": product.css('.availability::text')[-1].clean()
460
+ })
461
+ {'name': 'A Light in the ...', 'price': '51.77', 'stock': 'In stock'}
462
+ {'name': 'Soumission', 'price': '50.10', 'stock': 'In stock'}
463
+ {'name': 'Sharp Objects', 'price': '47.82', 'stock': 'In stock'}
464
+ ...
465
+ ```
466
+ The [documentation](https://github.com/D4Vinci/Scrapling/tree/main/docs/Examples) will provide more advanced examples.
467
+
468
+ ### Handling Structural Changes
469
+ Let's say you are scraping a page with a structure like this:
470
+ ```html
471
+ <div class="container">
472
+ <section class="products">
473
+ <article class="product" id="p1">
474
+ <h3>Product 1</h3>
475
+ <p class="description">Description 1</p>
476
+ </article>
477
+ <article class="product" id="p2">
478
+ <h3>Product 2</h3>
479
+ <p class="description">Description 2</p>
480
+ </article>
481
+ </section>
482
+ </div>
483
+ ```
484
+ And you want to scrape the first product, the one with the `p1` ID. You will probably write a selector like this
485
+ ```python
486
+ page.css('#p1')
487
+ ```
488
+ When website owners implement structural changes like
489
+ ```html
490
+ <div class="new-container">
491
+ <div class="product-wrapper">
492
+ <section class="products">
493
+ <article class="product new-class" data-id="p1">
494
+ <div class="product-info">
495
+ <h3>Product 1</h3>
496
+ <p class="new-description">Description 1</p>
497
+ </div>
498
+ </article>
499
+ <article class="product new-class" data-id="p2">
500
+ <div class="product-info">
501
+ <h3>Product 2</h3>
502
+ <p class="new-description">Description 2</p>
503
+ </div>
504
+ </article>
505
+ </section>
506
+ </div>
507
+ </div>
508
+ ```
509
+ The selector will no longer function and your code needs maintenance. That's where Scrapling's auto-matching feature comes into play.
510
+
511
+ ```python
512
+ from scrapling import Adaptor
513
+ # Before the change
514
+ page = Adaptor(page_source, url='example.com')
515
+ element = page.css('#p1' auto_save=True)
516
+ if not element: # One day website changes?
517
+ element = page.css('#p1', auto_match=True) # Scrapling still finds it!
518
+ # the rest of the code...
519
+ ```
520
+ > How does the auto-matching work? Check the [FAQs](#-enlightening-questions-and-faqs) section for that and other possible issues while auto-matching.
521
+
522
+ #### Real-World Scenario
523
+ Let's use a real website as an example and use one of the fetchers to fetch its source. To do this we need to find a website that will change its design/structure soon, take a copy of its source then wait for the website to make the change. Of course, that's nearly impossible to know unless I know the website's owner but that will make it a staged test haha.
524
+
525
+ To solve this issue, I will use [The Web Archive](https://archive.org/)'s [Wayback Machine](https://web.archive.org/). Here is a copy of [StackOverFlow's website in 2010](https://web.archive.org/web/20100102003420/http://stackoverflow.com/), pretty old huh?</br>Let's test if the automatch feature can extract the same button in the old design from 2010 and the current design using the same selector :)
526
+
527
+ If I want to extract the Questions button from the old design I can use a selector like this `#hmenus > div:nth-child(1) > ul > li:nth-child(1) > a` This selector is too specific because it was generated by Google Chrome.
528
+ Now let's test the same selector in both versions
529
+ ```python
530
+ >> from scrapling import Fetcher
531
+ >> selector = '#hmenus > div:nth-child(1) > ul > li:nth-child(1) > a'
532
+ >> old_url = "https://web.archive.org/web/20100102003420/http://stackoverflow.com/"
533
+ >> new_url = "https://stackoverflow.com/"
534
+ >>
535
+ >> page = Fetcher(automatch_domain='stackoverflow.com').get(old_url, timeout=30)
536
+ >> element1 = page.css_first(selector, auto_save=True)
537
+ >>
538
+ >> # Same selector but used in the updated website
539
+ >> page = Fetcher(automatch_domain="stackoverflow.com").get(new_url)
540
+ >> element2 = page.css_first(selector, auto_match=True)
541
+ >>
542
+ >> if element1.text == element2.text:
543
+ ... print('Scrapling found the same element in the old design and the new design!')
544
+ 'Scrapling found the same element in the old design and the new design!'
545
+ ```
546
+ Note that I used a new argument called `automatch_domain`, this is because for Scrapling these are two different URLs, not the website so it isolates their data. To tell Scrapling they are the same website, we then pass the domain we want to use for saving auto-match data for them both so Scrapling doesn't isolate them.
547
+
548
+ In a real-world scenario, the code will be the same except it will use the same URL for both requests so you won't need to use the `automatch_domain` argument. This is the closest example I can give to real-world cases so I hope it didn't confuse you :)
549
+
550
+ **Notes:**
551
+ 1. For the two examples above I used one time the `Adaptor` class and the second time the `Fetcher` class just to show you that you can create the `Adaptor` object by yourself if you have the source or fetch the source using any `Fetcher` class then it will create the `Adaptor` object for you.
552
+ 2. Passing the `auto_save` argument with the `auto_match` argument set to `False` while initializing the Adaptor/Fetcher object will only result in ignoring the `auto_save` argument value and the following warning message
553
+ ```text
554
+ Argument `auto_save` will be ignored because `auto_match` wasn't enabled on initialization. Check docs for more info.
555
+ ```
556
+ This behavior is purely for performance reasons so the database gets created/connected only when you are planning to use the auto-matching features. Same case with the `auto_match` argument.
557
+
558
+ 3. The `auto_match` parameter works only for `Adaptor` instances not `Adaptors` so if you do something like this you will get an error
559
+ ```python
560
+ page.css('body').css('#p1', auto_match=True)
561
+ ```
562
+ because you can't auto-match a whole list, you have to be specific and do something like
563
+ ```python
564
+ page.css_first('body').css('#p1', auto_match=True)
565
+ ```
566
+
567
+ ### Find elements by filters
568
+ Inspired by BeautifulSoup's `find_all` function you can find elements by using `find_all`/`find` methods. Both methods can take multiple types of filters and return all elements in the pages that all these filters apply to.
569
+
570
+ * To be more specific:
571
+ * Any string passed is considered a tag name
572
+ * Any iterable passed like List/Tuple/Set is considered an iterable of tag names.
573
+ * Any dictionary is considered a mapping of HTML element(s) attribute names and attribute values.
574
+ * Any regex patterns passed are used as filters
575
+ * Any functions passed are used as filters
576
+ * Any keyword argument passed is considered as an HTML element attribute with its value.
577
+
578
+ So the way it works is after collecting all passed arguments and keywords, each filter passes its results to the following filter in a waterfall-like filtering system.
579
+ <br/>It filters all elements in the current page/element in the following order:
580
+
581
+ 1. All elements with the passed tag name(s).
582
+ 2. All elements that match all passed attribute(s).
583
+ 3. All elements that match all passed regex patterns.
584
+ 4. All elements that fulfill all passed function(s).
585
+
586
+ Note: The filtering process always starts from the first filter it finds in the filtering order above so if no tag name(s) are passed but attributes are passed, the process starts from that layer and so on. **But the order in which you pass the arguments doesn't matter.**
587
+
588
+ Examples to clear any confusion :)
589
+
590
+ ```python
591
+ >> from scrapling import Fetcher
592
+ >> page = Fetcher().get('https://quotes.toscrape.com/')
593
+ # Find all elements with tag name `div`.
594
+ >> page.find_all('div')
595
+ [<data='<div class="container"> <div class="row...' parent='<body> <div class="container"> <div clas...'>,
596
+ <data='<div class="row header-box"> <div class=...' parent='<div class="container"> <div class="row...'>,
597
+ ...]
598
+
599
+ # Find all div elements with a class that equals `quote`.
600
+ >> page.find_all('div', class_='quote')
601
+ [<data='<div class="quote" itemscope itemtype="h...' parent='<div class="col-md-8"> <div class="quote...'>,
602
+ <data='<div class="quote" itemscope itemtype="h...' parent='<div class="col-md-8"> <div class="quote...'>,
603
+ ...]
604
+
605
+ # Same as above.
606
+ >> page.find_all('div', {'class': 'quote'})
607
+ [<data='<div class="quote" itemscope itemtype="h...' parent='<div class="col-md-8"> <div class="quote...'>,
608
+ <data='<div class="quote" itemscope itemtype="h...' parent='<div class="col-md-8"> <div class="quote...'>,
609
+ ...]
610
+
611
+ # Find all elements with a class that equals `quote`.
612
+ >> page.find_all({'class': 'quote'})
613
+ [<data='<div class="quote" itemscope itemtype="h...' parent='<div class="col-md-8"> <div class="quote...'>,
614
+ <data='<div class="quote" itemscope itemtype="h...' parent='<div class="col-md-8"> <div class="quote...'>,
615
+ ...]
616
+
617
+ # Find all div elements with a class that equals `quote`, and contains the element `.text` which contains the word 'world' in its content.
618
+ >> page.find_all('div', {'class': 'quote'}, lambda e: "world" in e.css_first('.text::text'))
619
+ [<data='<div class="quote" itemscope itemtype="h...' parent='<div class="col-md-8"> <div class="quote...'>]
620
+
621
+ # Find all elements that don't have children.
622
+ >> page.find_all(lambda element: len(element.children) > 0)
623
+ [<data='<html lang="en"><head><meta charset="UTF...'>,
624
+ <data='<head><meta charset="UTF-8"><title>Quote...' parent='<html lang="en"><head><meta charset="UTF...'>,
625
+ <data='<body> <div class="container"> <div clas...' parent='<html lang="en"><head><meta charset="UTF...'>,
626
+ ...]
627
+
628
+ # Find all elements that contain the word 'world' in its content.
629
+ >> page.find_all(lambda element: "world" in element.text)
630
+ [<data='<span class="text" itemprop="text">“The...' parent='<div class="quote" itemscope itemtype="h...'>,
631
+ <data='<a class="tag" href="/tag/world/page/1/"...' parent='<div class="tags"> Tags: <meta class="ke...'>]
632
+
633
+ # Find all span elements that match the given regex
634
+ >> page.find_all('span', re.compile(r'world'))
635
+ [<data='<span class="text" itemprop="text">“The...' parent='<div class="quote" itemscope itemtype="h...'>]
636
+
637
+ # Find all div and span elements with class 'quote' (No span elements like that so only div returned)
638
+ >> page.find_all(['div', 'span'], {'class': 'quote'})
639
+ [<data='<div class="quote" itemscope itemtype="h...' parent='<div class="col-md-8"> <div class="quote...'>,
640
+ <data='<div class="quote" itemscope itemtype="h...' parent='<div class="col-md-8"> <div class="quote...'>,
641
+ ...]
642
+
643
+ # Mix things up
644
+ >> page.find_all({'itemtype':"http://schema.org/CreativeWork"}, 'div').css('.author::text')
645
+ ['Albert Einstein',
646
+ 'J.K. Rowling',
647
+ ...]
648
+ ```
649
+
650
+ ### Is That All?
651
+ Here's what else you can do with Scrapling:
652
+
653
+ - Accessing the `lxml.etree` object itself of any element directly
654
+ ```python
655
+ >>> quote._root
656
+ <Element div at 0x107f98870>
657
+ ```
658
+ - Saving and retrieving elements manually to auto-match them outside the `css` and the `xpath` methods but you have to set the identifier by yourself.
659
+
660
+ - To save an element to the database:
661
+ ```python
662
+ >>> element = page.find_by_text('Tipping the Velvet', first_match=True)
663
+ >>> page.save(element, 'my_special_element')
664
+ ```
665
+ - Now later when you want to retrieve it and relocate it inside the page with auto-matching, it would be like this
666
+ ```python
667
+ >>> element_dict = page.retrieve('my_special_element')
668
+ >>> page.relocate(element_dict, adaptor_type=True)
669
+ [<data='<a href="catalogue/tipping-the-velvet_99...' parent='<h3><a href="catalogue/tipping-the-velve...'>]
670
+ >>> page.relocate(element_dict, adaptor_type=True).css('::text')
671
+ ['Tipping the Velvet']
672
+ ```
673
+ - if you want to keep it as `lxml.etree` object, leave the `adaptor_type` argument
674
+ ```python
675
+ >>> page.relocate(element_dict)
676
+ [<Element a at 0x105a2a7b0>]
677
+ ```
678
+
679
+ - Filtering results based on a function
680
+ ```python
681
+ # Find all products over $50
682
+ expensive_products = page.css('.product_pod').filter(
683
+ lambda p: float(p.css('.price_color').re_first(r'[\d\.]+')) > 50
684
+ )
685
+ ```
686
+
687
+ - Searching results for the first one that matches a function
688
+ ```python
689
+ # Find all the products with price '53.23'
690
+ page.css('.product_pod').search(
691
+ lambda p: float(p.css('.price_color').re_first(r'[\d\.]+')) == 54.23
692
+ )
693
+ ```
694
+
695
+ - Doing operations on element content is the same as scrapy
696
+ ```python
697
+ quote.re(r'regex_pattern') # Get all strings (TextHandlers) that match the regex pattern
698
+ quote.re_first(r'regex_pattern') # Get the first string (TextHandler) only
699
+ quote.json() # If the content text is jsonable, then convert it to json using `orjson` which is 10x faster than the standard json library and provides more options
700
+ ```
701
+ except that you can do more with them like
702
+ ```python
703
+ quote.re(
704
+ r'regex_pattern',
705
+ replace_entities=True, # Character entity references are replaced by their corresponding character
706
+ clean_match=True, # This will ignore all whitespaces and consecutive spaces while matching
707
+ case_sensitive= False, # Set the regex to ignore letters case while compiling it
708
+ )
709
+ ```
710
+ Hence all of these methods are methods from the `TextHandler` within that contains the text content so the same can be done directly if you call the `.text` property or equivalent selector function.
711
+
712
+
713
+ - Doing operations on the text content itself includes
714
+ - Cleaning the text from any white spaces and replacing consecutive spaces with single space
715
+ ```python
716
+ quote.clean()
717
+ ```
718
+ - You already know about the regex matching and the fast json parsing but did you know that all strings returned from the regex search are actually `TextHandler` objects too? so in cases where you have for example a JS object assigned to a JS variable inside JS code and want to extract it with regex and then convert it to json object, in other libraries, these would be more than 1 line of code but here you can do it in 1 line like this
719
+ ```python
720
+ page.xpath('//script/text()').re_first(r'var dataLayer = (.+);').json()
721
+ ```
722
+ - Sort all characters in the string as if it were a list and return the new string
723
+ ```python
724
+ quote.sort(reverse=False)
725
+ ```
726
+ > To be clear, `TextHandler` is a sub-class of Python's `str` so all normal operations/methods that work with Python strings will work with it.
727
+
728
+ - Any element's attributes are not exactly a dictionary but a sub-class of [mapping](https://docs.python.org/3/glossary.html#term-mapping) called `AttributesHandler` that's read-only so it's faster and string values returned are actually `TextHandler` objects so all operations above can be done on them, standard dictionary operations that don't modify the data, and more :)
729
+ - Unlike standard dictionaries, here you can search by values too and can do partial searches. It might be handy in some cases (returns a generator of matches)
730
+ ```python
731
+ >>> for item in element.attrib.search_values('catalogue', partial=True):
732
+ print(item)
733
+ {'href': 'catalogue/tipping-the-velvet_999/index.html'}
734
+ ```
735
+ - Serialize the current attributes to JSON bytes:
736
+ ```python
737
+ >>> element.attrib.json_string
738
+ b'{"href":"catalogue/tipping-the-velvet_999/index.html","title":"Tipping the Velvet"}'
739
+ ```
740
+ - Converting it to a normal dictionary
741
+ ```python
742
+ >>> dict(element.attrib)
743
+ {'href': 'catalogue/tipping-the-velvet_999/index.html',
744
+ 'title': 'Tipping the Velvet'}
745
+ ```
746
+
747
+ Scrapling is under active development so expect many more features coming soon :)
748
+
749
+ ## More Advanced Usage
750
+
751
+ There are a lot of deep details skipped here to make this as short as possible so to take a deep dive, head to the [docs](https://github.com/D4Vinci/Scrapling/tree/main/docs) section. I will try to keep it updated as possible and add complex examples. There I will explain points like how to write your storage system, write spiders that don't depend on selectors at all, and more...
752
+
753
+ Note that implementing your storage system can be complex as there are some strict rules such as inheriting from the same abstract class, following the singleton design pattern used in other classes, and more. So make sure to read the docs first.
754
+
755
+ > [!IMPORTANT]
756
+ > A website is needed to provide detailed library documentation.<br/>
757
+ > I'm trying to rush creating the website, researching new ideas, and adding more features/tests/benchmarks but time is tight with too many spinning plates between work, personal life, and working on Scrapling. I have been working on Scrapling for months for free after all.<br/><br/>
758
+ > If you like `Scrapling` and want it to keep improving then this is a friendly reminder that you can help by supporting me through the [sponsor button](https://github.com/sponsors/D4Vinci).
759
+
760
+ ## ⚡ Enlightening Questions and FAQs
761
+ This section addresses common questions about Scrapling, please read this section before opening an issue.
762
+
763
+ ### How does auto-matching work?
764
+ 1. You need to get a working selector and run it at least once with methods `css` or `xpath` with the `auto_save` parameter set to `True` before structural changes happen.
765
+ 2. Before returning results for you, Scrapling uses its configured database and saves unique properties about that element.
766
+ 3. Now because everything about the element can be changed or removed, nothing from the element can be used as a unique identifier for the database. To solve this issue, I made the storage system rely on two things:
767
+ 1. The domain of the URL you gave while initializing the first Adaptor object
768
+ 2. The `identifier` parameter you passed to the method while selecting. If you didn't pass one, then the selector string itself will be used as an identifier but remember you will have to use it as an identifier value later when the structure changes and you want to pass the new selector.
769
+
770
+ Together both are used to retrieve the element's unique properties from the database later.
771
+ 4. Now later when you enable the `auto_match` parameter for both the Adaptor instance and the method call. The element properties are retrieved and Scrapling loops over all elements in the page and compares each one's unique properties to the unique properties we already have for this element and a score is calculated for each one.
772
+ 5. Comparing elements is not exact but more about finding how similar these values are, so everything is taken into consideration, even the values' order, like the order in which the element class names were written before and the order in which the same element class names are written now.
773
+ 6. The score for each element is stored in the table, and the element(s) with the highest combined similarity scores are returned.
774
+
775
+ ### How does the auto-matching work if I didn't pass a URL while initializing the Adaptor object?
776
+ Not a big problem as it depends on your usage. The word `default` will be used in place of the URL field while saving the element's unique properties. So this will only be an issue if you used the same identifier later for a different website that you didn't pass the URL parameter while initializing it as well. The save process will overwrite the previous data and auto-matching uses the latest saved properties only.
777
+
778
+ ### If all things about an element can change or get removed, what are the unique properties to be saved?
779
+ For each element, Scrapling will extract:
780
+ - Element tag name, text, attributes (names and values), siblings (tag names only), and path (tag names only).
781
+ - Element's parent tag name, attributes (names and values), and text.
782
+
783
+ ### I have enabled the `auto_save`/`auto_match` parameter while selecting and it got completely ignored with a warning message
784
+ That's because passing the `auto_save`/`auto_match` argument without setting `auto_match` to `True` while initializing the Adaptor object will only result in ignoring the `auto_save`/`auto_match` argument value. This behavior is purely for performance reasons so the database gets created only when you are planning to use the auto-matching features.
785
+
786
+ ### I have done everything as the docs but the auto-matching didn't return anything, what's wrong?
787
+ It could be one of these reasons:
788
+ 1. No data were saved/stored for this element before.
789
+ 2. The selector passed is not the one used while storing element data. The solution is simple
790
+ - Pass the old selector again as an identifier to the method called.
791
+ - Retrieve the element with the retrieve method using the old selector as identifier then save it again with the save method and the new selector as identifier.
792
+ - Start using the identifier argument more often if you are planning to use every new selector from now on.
793
+ 3. The website had some extreme structural changes like a new full design. If this happens a lot with this website, the solution would be to make your code as selector-free as possible using Scrapling features.
794
+
795
+ ### Can Scrapling replace code built on top of BeautifulSoup4?
796
+ Pretty much yeah, almost all features you get from BeautifulSoup can be found or achieved in Scrapling one way or another. In fact, if you see there's a feature in bs4 that is missing in Scrapling, please make a feature request from the issues tab to let me know.
797
+
798
+ ### Can Scrapling replace code built on top of AutoScraper?
799
+ Of course, you can find elements by text/regex, find similar elements in a more reliable way than AutoScraper, and finally save/retrieve elements manually to use later as the model feature in AutoScraper. I have pulled all top articles about AutoScraper from Google and tested Scrapling against examples in them. In all examples, Scrapling got the same results as AutoScraper in much less time.
800
+
801
+ ### Is Scrapling thread-safe?
802
+ Yes, Scrapling instances are thread-safe. Each Adaptor instance maintains its state.
803
+
804
+ ## More Sponsors!
805
+ [![Capsolver Banner](https://raw.githubusercontent.com/D4Vinci/Scrapling/main/images/CapSolver.png)](https://www.capsolver.com/?utm_source=github&utm_medium=repo&utm_campaign=scraping&utm_term=Scrapling)
806
+
807
+ ## Contributing
808
+ Everybody is invited and welcome to contribute to Scrapling. There is a lot to do!
809
+
810
+ Please read the [contributing file](https://github.com/D4Vinci/Scrapling/blob/main/CONTRIBUTING.md) before doing anything.
811
+
812
+ ## Disclaimer for Scrapling Project
813
+ > [!CAUTION]
814
+ > This library is provided for educational and research purposes only. By using this library, you agree to comply with local and international laws regarding data scraping and privacy. The authors and contributors are not responsible for any misuse of this software. This library should not be used to violate the rights of others, for unethical purposes, or to use data in an unauthorized or illegal manner. Do not use it on any website unless you have permission from the website owner or within their allowed rules like the `robots.txt` file, for example.
815
+
816
+ ## License
817
+ This work is licensed under BSD-3
818
+
819
+ ## Acknowledgments
820
+ This project includes code adapted from:
821
+ - Parsel (BSD License) - Used for [translator](https://github.com/D4Vinci/Scrapling/blob/main/scrapling/translator.py) submodule
822
+
823
+ ## Thanks and References
824
+ - [Daijro](https://github.com/daijro)'s brilliant work on both [BrowserForge](https://github.com/daijro/browserforge) and [Camoufox](https://github.com/daijro/camoufox)
825
+ - [Vinyzu](https://github.com/Vinyzu)'s work on Playwright's mock on [Botright](https://github.com/Vinyzu/Botright)
826
+ - [brotector](https://github.com/kaliiiiiiiiii/brotector)
827
+ - [fakebrowser](https://github.com/kkoooqq/fakebrowser)
828
+ - [rebrowser-patches](https://github.com/rebrowser/rebrowser-patches)
829
+
830
+ ## Known Issues
831
+ - In the auto-matching save process, the unique properties of the first element from the selection results are the only ones that get saved. So if the selector you are using selects different elements on the page that are in different locations, auto-matching will probably return to you the first element only when you relocate it later. This doesn't include combined CSS selectors (Using commas to combine more than one selector for example) as these selectors get separated and each selector gets executed alone.
832
+ - Currently, Scrapling is not compatible with async/await.
833
+
834
+ ---
835
+ <div align="center"><small>Designed & crafted with ❤️ by Karim Shoair.</small></div><br>