PyPI - scrapling - Versions diffs - 0.2.8__tar.gz → 0.2.91__tar.gz - Mend

scrapling 0.2.8tar.gz → 0.2.91tar.gz

Files changed (65) hide show

{scrapling-0.2.8/scrapling.egg-info → scrapling-0.2.91}/PKG-INFO RENAMED Viewed

@@ -1,6 +1,6 @@
 Metadata-Version: 2.1
 Name: scrapling
-Version: 0.2.8
+Version: 0.2.91
 Summary: Scrapling is a powerful, flexible, and high-performance web scraping library for Python. It
 Home-page: https://github.com/D4Vinci/Scrapling
 Author: Karim Shoair
@@ -21,7 +21,6 @@ Classifier: Topic :: Text Processing :: Markup :: HTML
 Classifier: Topic :: Software Development :: Libraries :: Python Modules
 Classifier: Programming Language :: Python :: 3
 Classifier: Programming Language :: Python :: 3 :: Only
-Classifier: Programming Language :: Python :: 3.8
 Classifier: Programming Language :: Python :: 3.9
 Classifier: Programming Language :: Python :: 3.10
 Classifier: Programming Language :: Python :: 3.11
@@ -29,7 +28,7 @@ Classifier: Programming Language :: Python :: 3.12
 Classifier: Programming Language :: Python :: 3.13
 Classifier: Programming Language :: Python :: Implementation :: CPython
 Classifier: Typing :: Typed
-Requires-Python: >=3.8
+Requires-Python: >=3.9
 Description-Content-Type: text/markdown
 License-File: LICENSE
 Requires-Dist: requests>=2.3
@@ -38,11 +37,10 @@ Requires-Dist: cssselect>=1.2
 Requires-Dist: w3lib
 Requires-Dist: orjson>=3
 Requires-Dist: tldextract
-Requires-Dist: httpx[brotli,zstd]
-Requires-Dist: playwright==1.48
-Requires-Dist: rebrowser-playwright
-Requires-Dist: camoufox>=0.4.4
-Requires-Dist: browserforge
+Requires-Dist: httpx[brotli,socks,zstd]
+Requires-Dist: playwright>=1.49.1
+Requires-Dist: rebrowser-playwright>=1.49.1
+Requires-Dist: camoufox[geoip]>=0.4.9
 # 🕷️ Scrapling: Undetectable, Lightning-Fast, and Adaptive Web Scraping for Python
 [![Tests](https://github.com/D4Vinci/Scrapling/actions/workflows/tests.yml/badge.svg)](https://github.com/D4Vinci/Scrapling/actions/workflows/tests.yml) [![PyPI version](https://badge.fury.io/py/Scrapling.svg)](https://badge.fury.io/py/Scrapling) [![Supported Python versions](https://img.shields.io/pypi/pyversions/scrapling.svg)](https://pypi.org/project/scrapling/) [![PyPI Downloads](https://static.pepy.tech/badge/scrapling)](https://pepy.tech/project/scrapling)
@@ -52,7 +50,7 @@ Dealing with failing web scrapers due to anti-bot protections or website changes
 Scrapling is a high-performance, intelligent web scraping library for Python that automatically adapts to website changes while significantly outperforming popular alternatives. For both beginners and experts, Scrapling provides powerful features while maintaining simplicity.
 ```python
->> from scrapling.defaults import Fetcher, StealthyFetcher, PlayWrightFetcher
+>> from scrapling.defaults import Fetcher, AsyncFetcher, StealthyFetcher, PlayWrightFetcher
 # Fetch websites' source under the radar!
 >> page = StealthyFetcher.fetch('https://example.com', headless=True, network_idle=True)
 >> print(page.status)
@@ -81,7 +79,7 @@ Scrapling is a high-performance, intelligent web scraping library for Python tha
 ## Table of content
   * [Key Features](#key-features)
-    * [Fetch websites as you prefer](#fetch-websites-as-you-prefer)
+    * [Fetch websites as you prefer](#fetch-websites-as-you-prefer-with-async-support)
     * [Adaptive Scraping](#adaptive-scraping)
     * [Performance](#performance)
     * [Developing Experience](#developing-experience)
@@ -122,7 +120,7 @@ Scrapling is a high-performance, intelligent web scraping library for Python tha
 ## Key Features
-### Fetch websites as you prefer
+### Fetch websites as you prefer with async support
 - **HTTP requests**: Stealthy and fast HTTP requests with `Fetcher`
 - **Stealthy fetcher**: Annoying anti-bot protection? No problem! Scrapling can bypass almost all of them with `StealthyFetcher` with default configuration!
 - **Your preferred browser**: Use your real browser with CDP, [NSTbrowser](https://app.nstbrowser.io/r/1vO5e5)'s browserless, PlayWright with stealth mode, or even vanilla PlayWright -  All is possible with `PlayWrightFetcher`!
@@ -213,7 +211,7 @@ Scrapling can find elements with more methods and it returns full element `Adapt
 > All benchmarks' results are an average of 100 runs. See our [benchmarks.py](https://github.com/D4Vinci/Scrapling/blob/main/benchmarks.py) for methodology and to run your comparisons.
 ## Installation
-Scrapling is a breeze to get started with - Starting from version 0.2, we require at least Python 3.8 to work.
+Scrapling is a breeze to get started with - Starting from version 0.2.9, we require at least Python 3.9 to work.
 ```bash
 pip3 install scrapling
 ```
@@ -265,11 +263,11 @@ You might be slightly confused by now so let me clear things up. All fetcher-typ
 ```python
 from scrapling import Fetcher, StealthyFetcher, PlayWrightFetcher
 ```
-All of them can take these initialization arguments: `auto_match`, `huge_tree`, `keep_comments`, `storage`, `storage_args`, and `debug`, which are the same ones you give to the `Adaptor` class.
+All of them can take these initialization arguments: `auto_match`, `huge_tree`, `keep_comments`, `keep_cdata`, `storage`, and `storage_args`, which are the same ones you give to the `Adaptor` class.
 If you don't want to pass arguments to the generated `Adaptor` object and want to use the default values, you can use this import instead for cleaner code:
 ```python
-from scrapling.defaults import Fetcher, StealthyFetcher, PlayWrightFetcher
+from scrapling.defaults import Fetcher, AsyncFetcher, StealthyFetcher, PlayWrightFetcher
 ```
 then use it right away without initializing like:
 ```python
@@ -282,21 +280,32 @@ Also, the `Response` object returned from all fetchers is the same as the `Adapt
 ### Fetcher
 This class is built on top of [httpx](https://www.python-httpx.org/) with additional configuration options, here you can do `GET`, `POST`, `PUT`, and `DELETE` requests.
-For all methods, you have `stealth_headers` which makes `Fetcher` create and use real browser's headers then create a referer header as if this request came from Google's search of this URL's domain. It's enabled by default.
+For all methods, you have `stealthy_headers` which makes `Fetcher` create and use real browser's headers then create a referer header as if this request came from Google's search of this URL's domain. It's enabled by default. You can also set the number of retries with the argument `retries` for all methods and this will make httpx retry requests if it failed for any reason. The default number of retries for all `Fetcher` methods is 3.
 You can route all traffic (HTTP and HTTPS) to a proxy for any of these methods in this format `http://username:password@localhost:8030`
 ```python
->> page = Fetcher().get('https://httpbin.org/get', stealth_headers=True, follow_redirects=True)
+>> page = Fetcher().get('https://httpbin.org/get', stealthy_headers=True, follow_redirects=True)
 >> page = Fetcher().post('https://httpbin.org/post', data={'key': 'value'}, proxy='http://username:password@localhost:8030')
 >> page = Fetcher().put('https://httpbin.org/put', data={'key': 'value'})
 >> page = Fetcher().delete('https://httpbin.org/delete')
 ```
+For Async requests, you will just replace the import like below:
+```python
+>> from scrapling import AsyncFetcher
+>> page = await AsyncFetcher().get('https://httpbin.org/get', stealthy_headers=True, follow_redirects=True)
+>> page = await AsyncFetcher().post('https://httpbin.org/post', data={'key': 'value'}, proxy='http://username:password@localhost:8030')
+>> page = await AsyncFetcher().put('https://httpbin.org/put', data={'key': 'value'})
+>> page = await AsyncFetcher().delete('https://httpbin.org/delete')
+```
 ### StealthyFetcher
 This class is built on top of [Camoufox](https://github.com/daijro/camoufox), bypassing most anti-bot protections by default. Scrapling adds extra layers of flavors and configurations to increase performance and undetectability even further.
 ```python
 >> page = StealthyFetcher().fetch('https://www.browserscan.net/bot-detection')  # Running headless by default
 >> page.status == 200
 True
+>> page = await StealthyFetcher().async_fetch('https://www.browserscan.net/bot-detection')  # the async version of fetch
+>> page.status == 200
+True
 ```
 > Note: all requests done by this fetcher are waiting by default for all JS to be fully loaded and executed so you don't have to :)
@@ -314,7 +323,8 @@ True
 |     page_action     | Added for automation. A function that takes the `page` object, does the automation you need, then returns `page` again.                                                                                                                                                                                                                                                                                         |    ✔️    |
 |       addons        | List of Firefox addons to use. **Must be paths to extracted addons.**                                                                                                                                                                                                                                                                                                                                           |    ✔️    |
 |      humanize       | Humanize the cursor movement. Takes either True or the MAX duration in seconds of the cursor movement. The cursor typically takes up to 1.5 seconds to move across the window.                                                                                                                                                                                                                                  |    ✔️    |
-|     allow_webgl     | Whether to allow WebGL. To prevent leaks, only use this for special cases.                                                                                                                                                                                                                                                                                                                                      |    ✔️    |
+|     allow_webgl     | Enabled by default. Disabling it WebGL not recommended as many WAFs now checks if WebGL is enabled.                                                                                                                                                                                                                                                                                                             |    ✔️    |
+|        geoip        | Recommended to use with proxies; Automatically use IP's longitude, latitude, timezone, country, locale, & spoof the WebRTC IP address. It will also calculate and spoof the browser's language based on the distribution of language speakers in the target region.                                                                                                                                             |    ✔️    |
 |     disable_ads     | Enabled by default, this installs `uBlock Origin` addon on the browser if enabled.                                                                                                                                                                                                                                                                                                                              |    ✔️    |
 |    network_idle     | Wait for the page until there are no network connections for at least 500 ms.                                                                                                                                                                                                                                                                                                                                   |    ✔️    |
 |       timeout       | The timeout in milliseconds that is used in all operations and waits through the page. The default is 30000.                                                                                                                                                                                                                                                                                                    |    ✔️    |
@@ -333,6 +343,9 @@ This class is built on top of [Playwright](https://playwright.dev/python/) which
 >> page = PlayWrightFetcher().fetch('https://www.google.com/search?q=%22Scrapling%22', disable_resources=True)  # Vanilla Playwright option
 >> page.css_first("#search a::attr(href)")
 'https://github.com/D4Vinci/Scrapling'
+>> page = await PlayWrightFetcher().async_fetch('https://www.google.com/search?q=%22Scrapling%22', disable_resources=True)  # the async version of fetch
+>> page.css_first("#search a::attr(href)")
+'https://github.com/D4Vinci/Scrapling'
 ```
 > Note: all requests done by this fetcher are waiting by default for all JS to be fully loaded and executed so you don't have to :)
@@ -437,6 +450,9 @@ You can select elements by their text content in multiple ways, here's a full ex
 >>> page.find_by_text('Tipping the Velvet')  # Find the first element whose text fully matches this text
 <data='<a href="catalogue/tipping-the-velvet_99...' parent='<h3><a href="catalogue/tipping-the-velve...'>
+>>> page.urljoin(page.find_by_text('Tipping the Velvet').attrib['href'])  # We use `page.urljoin` to return the full URL from the relative `href`
+'https://books.toscrape.com/catalogue/tipping-the-velvet_999/index.html'
 >>> page.find_by_text('Tipping the Velvet', first_match=False)  # Get all matches if there are more
 [<data='<a href="catalogue/tipping-the-velvet_99...' parent='<h3><a href="catalogue/tipping-the-velve...'>]
@@ -850,7 +866,6 @@ This project includes code adapted from:
 ## Known Issues
 - In the auto-matching save process, the unique properties of the first element from the selection results are the only ones that get saved. So if the selector you are using selects different elements on the page that are in different locations, auto-matching will probably return to you the first element only when you relocate it later. This doesn't include combined CSS selectors (Using commas to combine more than one selector for example) as these selectors get separated and each selector gets executed alone.
-- Currently, Scrapling is not compatible with async/await.
 ---
 <div align="center"><small>Designed & crafted with ❤️ by Karim Shoair.</small></div><br>

{scrapling-0.2.8 → scrapling-0.2.91}/README.md RENAMED Viewed

@@ -6,7 +6,7 @@ Dealing with failing web scrapers due to anti-bot protections or website changes
 Scrapling is a high-performance, intelligent web scraping library for Python that automatically adapts to website changes while significantly outperforming popular alternatives. For both beginners and experts, Scrapling provides powerful features while maintaining simplicity.
 ```python
->> from scrapling.defaults import Fetcher, StealthyFetcher, PlayWrightFetcher
+>> from scrapling.defaults import Fetcher, AsyncFetcher, StealthyFetcher, PlayWrightFetcher
 # Fetch websites' source under the radar!
 >> page = StealthyFetcher.fetch('https://example.com', headless=True, network_idle=True)
 >> print(page.status)
@@ -35,7 +35,7 @@ Scrapling is a high-performance, intelligent web scraping library for Python tha
 ## Table of content
   * [Key Features](#key-features)
-    * [Fetch websites as you prefer](#fetch-websites-as-you-prefer)
+    * [Fetch websites as you prefer](#fetch-websites-as-you-prefer-with-async-support)
     * [Adaptive Scraping](#adaptive-scraping)
     * [Performance](#performance)
     * [Developing Experience](#developing-experience)
@@ -76,7 +76,7 @@ Scrapling is a high-performance, intelligent web scraping library for Python tha
 ## Key Features
-### Fetch websites as you prefer
+### Fetch websites as you prefer with async support
 - **HTTP requests**: Stealthy and fast HTTP requests with `Fetcher`
 - **Stealthy fetcher**: Annoying anti-bot protection? No problem! Scrapling can bypass almost all of them with `StealthyFetcher` with default configuration!
 - **Your preferred browser**: Use your real browser with CDP, [NSTbrowser](https://app.nstbrowser.io/r/1vO5e5)'s browserless, PlayWright with stealth mode, or even vanilla PlayWright -  All is possible with `PlayWrightFetcher`!
@@ -167,7 +167,7 @@ Scrapling can find elements with more methods and it returns full element `Adapt
 > All benchmarks' results are an average of 100 runs. See our [benchmarks.py](https://github.com/D4Vinci/Scrapling/blob/main/benchmarks.py) for methodology and to run your comparisons.
 ## Installation
-Scrapling is a breeze to get started with - Starting from version 0.2, we require at least Python 3.8 to work.
+Scrapling is a breeze to get started with - Starting from version 0.2.9, we require at least Python 3.9 to work.
 ```bash
 pip3 install scrapling
 ```
@@ -219,11 +219,11 @@ You might be slightly confused by now so let me clear things up. All fetcher-typ
 ```python
 from scrapling import Fetcher, StealthyFetcher, PlayWrightFetcher
 ```
-All of them can take these initialization arguments: `auto_match`, `huge_tree`, `keep_comments`, `storage`, `storage_args`, and `debug`, which are the same ones you give to the `Adaptor` class.
+All of them can take these initialization arguments: `auto_match`, `huge_tree`, `keep_comments`, `keep_cdata`, `storage`, and `storage_args`, which are the same ones you give to the `Adaptor` class.
 If you don't want to pass arguments to the generated `Adaptor` object and want to use the default values, you can use this import instead for cleaner code:
 ```python
-from scrapling.defaults import Fetcher, StealthyFetcher, PlayWrightFetcher
+from scrapling.defaults import Fetcher, AsyncFetcher, StealthyFetcher, PlayWrightFetcher
 ```
 then use it right away without initializing like:
 ```python
@@ -236,21 +236,32 @@ Also, the `Response` object returned from all fetchers is the same as the `Adapt
 ### Fetcher
 This class is built on top of [httpx](https://www.python-httpx.org/) with additional configuration options, here you can do `GET`, `POST`, `PUT`, and `DELETE` requests.
-For all methods, you have `stealth_headers` which makes `Fetcher` create and use real browser's headers then create a referer header as if this request came from Google's search of this URL's domain. It's enabled by default.
+For all methods, you have `stealthy_headers` which makes `Fetcher` create and use real browser's headers then create a referer header as if this request came from Google's search of this URL's domain. It's enabled by default. You can also set the number of retries with the argument `retries` for all methods and this will make httpx retry requests if it failed for any reason. The default number of retries for all `Fetcher` methods is 3.
 You can route all traffic (HTTP and HTTPS) to a proxy for any of these methods in this format `http://username:password@localhost:8030`
 ```python
->> page = Fetcher().get('https://httpbin.org/get', stealth_headers=True, follow_redirects=True)
+>> page = Fetcher().get('https://httpbin.org/get', stealthy_headers=True, follow_redirects=True)
 >> page = Fetcher().post('https://httpbin.org/post', data={'key': 'value'}, proxy='http://username:password@localhost:8030')
 >> page = Fetcher().put('https://httpbin.org/put', data={'key': 'value'})
 >> page = Fetcher().delete('https://httpbin.org/delete')
 ```
+For Async requests, you will just replace the import like below:
+```python
+>> from scrapling import AsyncFetcher
+>> page = await AsyncFetcher().get('https://httpbin.org/get', stealthy_headers=True, follow_redirects=True)
+>> page = await AsyncFetcher().post('https://httpbin.org/post', data={'key': 'value'}, proxy='http://username:password@localhost:8030')
+>> page = await AsyncFetcher().put('https://httpbin.org/put', data={'key': 'value'})
+>> page = await AsyncFetcher().delete('https://httpbin.org/delete')
+```
 ### StealthyFetcher
 This class is built on top of [Camoufox](https://github.com/daijro/camoufox), bypassing most anti-bot protections by default. Scrapling adds extra layers of flavors and configurations to increase performance and undetectability even further.
 ```python
 >> page = StealthyFetcher().fetch('https://www.browserscan.net/bot-detection')  # Running headless by default
 >> page.status == 200
 True
+>> page = await StealthyFetcher().async_fetch('https://www.browserscan.net/bot-detection')  # the async version of fetch
+>> page.status == 200
+True
 ```
 > Note: all requests done by this fetcher are waiting by default for all JS to be fully loaded and executed so you don't have to :)
@@ -268,7 +279,8 @@ True
 |     page_action     | Added for automation. A function that takes the `page` object, does the automation you need, then returns `page` again.                                                                                                                                                                                                                                                                                         |    ✔️    |
 |       addons        | List of Firefox addons to use. **Must be paths to extracted addons.**                                                                                                                                                                                                                                                                                                                                           |    ✔️    |
 |      humanize       | Humanize the cursor movement. Takes either True or the MAX duration in seconds of the cursor movement. The cursor typically takes up to 1.5 seconds to move across the window.                                                                                                                                                                                                                                  |    ✔️    |
-|     allow_webgl     | Whether to allow WebGL. To prevent leaks, only use this for special cases.                                                                                                                                                                                                                                                                                                                                      |    ✔️    |
+|     allow_webgl     | Enabled by default. Disabling it WebGL not recommended as many WAFs now checks if WebGL is enabled.                                                                                                                                                                                                                                                                                                             |    ✔️    |
+|        geoip        | Recommended to use with proxies; Automatically use IP's longitude, latitude, timezone, country, locale, & spoof the WebRTC IP address. It will also calculate and spoof the browser's language based on the distribution of language speakers in the target region.                                                                                                                                             |    ✔️    |
 |     disable_ads     | Enabled by default, this installs `uBlock Origin` addon on the browser if enabled.                                                                                                                                                                                                                                                                                                                              |    ✔️    |
 |    network_idle     | Wait for the page until there are no network connections for at least 500 ms.                                                                                                                                                                                                                                                                                                                                   |    ✔️    |
 |       timeout       | The timeout in milliseconds that is used in all operations and waits through the page. The default is 30000.                                                                                                                                                                                                                                                                                                    |    ✔️    |
@@ -287,6 +299,9 @@ This class is built on top of [Playwright](https://playwright.dev/python/) which
 >> page = PlayWrightFetcher().fetch('https://www.google.com/search?q=%22Scrapling%22', disable_resources=True)  # Vanilla Playwright option
 >> page.css_first("#search a::attr(href)")
 'https://github.com/D4Vinci/Scrapling'
+>> page = await PlayWrightFetcher().async_fetch('https://www.google.com/search?q=%22Scrapling%22', disable_resources=True)  # the async version of fetch
+>> page.css_first("#search a::attr(href)")
+'https://github.com/D4Vinci/Scrapling'
 ```
 > Note: all requests done by this fetcher are waiting by default for all JS to be fully loaded and executed so you don't have to :)
@@ -391,6 +406,9 @@ You can select elements by their text content in multiple ways, here's a full ex
 >>> page.find_by_text('Tipping the Velvet')  # Find the first element whose text fully matches this text
 <data='<a href="catalogue/tipping-the-velvet_99...' parent='<h3><a href="catalogue/tipping-the-velve...'>
+>>> page.urljoin(page.find_by_text('Tipping the Velvet').attrib['href'])  # We use `page.urljoin` to return the full URL from the relative `href`
+'https://books.toscrape.com/catalogue/tipping-the-velvet_999/index.html'
 >>> page.find_by_text('Tipping the Velvet', first_match=False)  # Get all matches if there are more
 [<data='<a href="catalogue/tipping-the-velvet_99...' parent='<h3><a href="catalogue/tipping-the-velve...'>]
@@ -804,7 +822,6 @@ This project includes code adapted from:
 ## Known Issues
 - In the auto-matching save process, the unique properties of the first element from the selection results are the only ones that get saved. So if the selector you are using selects different elements on the page that are in different locations, auto-matching will probably return to you the first element only when you relocate it later. This doesn't include combined CSS selectors (Using commas to combine more than one selector for example) as these selectors get separated and each selector gets executed alone.
-- Currently, Scrapling is not compatible with async/await.
 ---
 <div align="center"><small>Designed & crafted with ❤️ by Karim Shoair.</small></div><br>

{scrapling-0.2.8 → scrapling-0.2.91}/scrapling/__init__.py RENAMED Viewed

@@ -1,12 +1,12 @@
 # Declare top-level shortcuts
 from scrapling.core.custom_types import AttributesHandler, TextHandler
-from scrapling.fetchers import (CustomFetcher, Fetcher, PlayWrightFetcher,
-                                StealthyFetcher)
+from scrapling.fetchers import (AsyncFetcher, CustomFetcher, Fetcher,
+                                PlayWrightFetcher, StealthyFetcher)
 from scrapling.parser import Adaptor, Adaptors
 __author__ = "Karim Shoair (karim.shoair@pm.me)"
-__version__ = "0.2.8"
+__version__ = "0.2.91"
 __copyright__ = "Copyright (c) 2024 Karim Shoair"
-__all__ = ['Adaptor', 'Fetcher', 'StealthyFetcher', 'PlayWrightFetcher']
+__all__ = ['Adaptor', 'Fetcher', 'AsyncFetcher', 'StealthyFetcher', 'PlayWrightFetcher']

{scrapling-0.2.8 → scrapling-0.2.91}/scrapling/core/_types.py RENAMED Viewed

@@ -5,6 +5,8 @@ Type definitions for type checking purposes.
 from typing import (TYPE_CHECKING, Any, Callable, Dict, Generator, Iterable,
                     List, Literal, Optional, Pattern, Tuple, Type, Union)
+SelectorWaitStates = Literal["attached", "detached", "hidden", "visible"]
 try:
     from typing import Protocol
 except ImportError:

{scrapling-0.2.8 → scrapling-0.2.91}/scrapling/core/custom_types.py RENAMED Viewed

@@ -14,11 +14,70 @@ class TextHandler(str):
     __slots__ = ()
     def __new__(cls, string):
-        # Because str is immutable and we can't override __init__
-        if type(string) is str:
+        if isinstance(string, str):
             return super().__new__(cls, string)
-        else:
-            return super().__new__(cls, '')
+        return super().__new__(cls, '')
+    # Make methods from original `str` class return `TextHandler` instead of returning `str` again
+    # Of course, this stupid workaround is only so we can keep the auto-completion working without issues in your IDE
+    # and I made sonnet write it for me :)
+    def strip(self, chars=None):
+        return TextHandler(super().strip(chars))
+    def lstrip(self, chars=None):
+        return TextHandler(super().lstrip(chars))
+    def rstrip(self, chars=None):
+        return TextHandler(super().rstrip(chars))
+    def capitalize(self):
+        return TextHandler(super().capitalize())
+    def casefold(self):
+        return TextHandler(super().casefold())
+    def center(self, width, fillchar=' '):
+        return TextHandler(super().center(width, fillchar))
+    def expandtabs(self, tabsize=8):
+        return TextHandler(super().expandtabs(tabsize))
+    def format(self, *args, **kwargs):
+        return TextHandler(super().format(*args, **kwargs))
+    def format_map(self, mapping):
+        return TextHandler(super().format_map(mapping))
+    def join(self, iterable):
+        return TextHandler(super().join(iterable))
+    def ljust(self, width, fillchar=' '):
+        return TextHandler(super().ljust(width, fillchar))
+    def rjust(self, width, fillchar=' '):
+        return TextHandler(super().rjust(width, fillchar))
+    def swapcase(self):
+        return TextHandler(super().swapcase())
+    def title(self):
+        return TextHandler(super().title())
+    def translate(self, table):
+        return TextHandler(super().translate(table))
+    def zfill(self, width):
+        return TextHandler(super().zfill(width))
+    def replace(self, old, new, count=-1):
+        return TextHandler(super().replace(old, new, count))
+    def upper(self):
+        return TextHandler(super().upper())
+    def lower(self):
+        return TextHandler(super().lower())
+    ##############
     def sort(self, reverse: bool = False) -> str:
         """Return a sorted version of the string"""
@@ -30,11 +89,21 @@ class TextHandler(str):
         data = re.sub(' +', ' ', data)
         return self.__class__(data.strip())
+    # For easy copy-paste from Scrapy/parsel code when needed :)
+    def get(self, default=None):
+        return self
+    def get_all(self):
+        return self
+    extract = get_all
+    extract_first = get
     def json(self) -> Dict:
         """Return json response if the response is jsonable otherwise throw error"""
-        # Using __str__ function as a workaround for orjson issue with subclasses of str
+        # Using str function as a workaround for orjson issue with subclasses of str
         # Check this out: https://github.com/ijl/orjson/issues/445
-        return loads(self.__str__())
+        return loads(str(self))
     def re(
             self, regex: Union[str, Pattern[str]], replace_entities: bool = True, clean_match: bool = False,
@@ -127,6 +196,19 @@ class TextHandlers(List[TextHandler]):
                 return result
         return default
+    # For easy copy-paste from Scrapy/parsel code when needed :)
+    def get(self, default=None):
+        """Returns the first item of the current list
+        :param default: the default value to return if the current list is empty
+        """
+        return self[0] if len(self) > 0 else default
+    def extract(self):
+        return self
+    extract_first = get
+    get_all = extract
 class AttributesHandler(Mapping):
     """A read-only mapping to use instead of the standard dictionary for the speed boost but at the same time I use it to add more functionalities.

{scrapling-0.2.8 → scrapling-0.2.91}/scrapling/core/storage_adaptors.py RENAMED Viewed

@@ -1,4 +1,3 @@
-import logging
 import sqlite3
 import threading
 from abc import ABC, abstractmethod
@@ -9,7 +8,7 @@ from lxml import html
 from tldextract import extract as tld
 from scrapling.core._types import Dict, Optional, Union
-from scrapling.core.utils import _StorageTools, cache
+from scrapling.core.utils import _StorageTools, log, lru_cache
 class StorageSystemMixin(ABC):
@@ -20,7 +19,7 @@ class StorageSystemMixin(ABC):
         """
         self.url = url
-    @cache(None, typed=True)
+    @lru_cache(None, typed=True)
     def _get_base_url(self, default_value: str = 'default') -> str:
         if not self.url or type(self.url) is not str:
             return default_value
@@ -52,7 +51,7 @@ class StorageSystemMixin(ABC):
         raise NotImplementedError('Storage system must implement `save` method')
     @staticmethod
-    @cache(None, typed=True)
+    @lru_cache(None, typed=True)
     def _get_hash(identifier: str) -> str:
         """If you want to hash identifier in your storage system, use this safer"""
         identifier = identifier.lower().strip()
@@ -64,7 +63,7 @@ class StorageSystemMixin(ABC):
         return f"{hash_value}_{len(identifier)}"  # Length to reduce collision chance
-@cache(None, typed=True)
+@lru_cache(None, typed=True)
 class SQLiteStorageSystem(StorageSystemMixin):
     """The recommended system to use, it's race condition safe and thread safe.
     Mainly built so the library can run in threaded frameworks like scrapy or threaded tools
@@ -86,7 +85,7 @@ class SQLiteStorageSystem(StorageSystemMixin):
         self.connection.execute("PRAGMA journal_mode=WAL")
         self.cursor = self.connection.cursor()
         self._setup_database()
-        logging.debug(
+        log.debug(
             f'Storage system loaded with arguments (storage_file="{storage_file}", url="{url}")'
         )

{scrapling-0.2.8 → scrapling-0.2.91}/scrapling/core/translator.py RENAMED Viewed

@@ -17,7 +17,7 @@ from cssselect.xpath import XPathExpr as OriginalXPathExpr
 from w3lib.html import HTML5_WHITESPACE
 from scrapling.core._types import Any, Optional, Protocol, Self
-from scrapling.core.utils import cache
+from scrapling.core.utils import lru_cache
 regex = f"[{HTML5_WHITESPACE}]+"
 replace_html5_whitespaces = re.compile(regex).sub
@@ -139,6 +139,6 @@ class TranslatorMixin:
 class HTMLTranslator(TranslatorMixin, OriginalHTMLTranslator):
-    @cache(maxsize=256)
+    @lru_cache(maxsize=256)
     def css_to_xpath(self, css: str, prefix: str = "descendant-or-self::") -> str:
         return super().css_to_xpath(css, prefix)

{scrapling-0.2.8 → scrapling-0.2.91}/scrapling/core/utils.py RENAMED Viewed

@@ -9,17 +9,36 @@ from scrapling.core._types import Any, Dict, Iterable, Union
 # Using cache on top of a class is brilliant way to achieve Singleton design pattern without much code
 # functools.cache is available on Python 3.9+ only so let's keep lru_cache
-from functools import lru_cache as cache  # isort:skip
+from functools import lru_cache  # isort:skip
 html_forbidden = {html.HtmlComment, }
-logging.basicConfig(
-    level=logging.ERROR,
-    format='%(asctime)s - %(levelname)s - %(message)s',
-    handlers=[
-        logging.StreamHandler()
-    ]
-)
+@lru_cache(1, typed=True)
+def setup_logger():
+    """Create and configure a logger with a standard format.
+    :returns: logging.Logger: Configured logger instance
+    """
+    logger = logging.getLogger('scrapling')
+    logger.setLevel(logging.INFO)
+    formatter = logging.Formatter(
+        fmt="[%(asctime)s] %(levelname)s: %(message)s",
+        datefmt="%Y-%m-%d %H:%M:%S"
+    )
+    console_handler = logging.StreamHandler()
+    console_handler.setFormatter(formatter)
+    # Add handler to logger (if not already added)
+    if not logger.handlers:
+        logger.addHandler(console_handler)
+    return logger
+log = setup_logger()
 def is_jsonable(content: Union[bytes, str]) -> bool:
@@ -33,23 +52,6 @@ def is_jsonable(content: Union[bytes, str]) -> bool:
         return False
-@cache(None, typed=True)
-def setup_basic_logging(level: str = 'debug'):
-    levels = {
-        'debug': logging.DEBUG,
-        'info': logging.INFO,
-        'warning': logging.WARNING,
-        'error': logging.ERROR,
-        'critical': logging.CRITICAL
-    }
-    formatter = logging.Formatter("[%(asctime)s] %(levelname)s: %(message)s", "%Y-%m-%d %H:%M:%S")
-    lvl = levels[level.lower()]
-    handler = logging.StreamHandler()
-    handler.setFormatter(formatter)
-    # Configure the root logger
-    logging.basicConfig(level=lvl, handlers=[handler])
 def flatten(lst: Iterable):
     return list(chain.from_iterable(lst))
@@ -113,7 +115,7 @@ class _StorageTools:
 #     return _impl
-@cache(None, typed=True)
+@lru_cache(None, typed=True)
 def clean_spaces(string):
     string = string.replace('\t', ' ')
     string = re.sub('[\n|\r]', '', string)

{scrapling-0.2.8 → scrapling-0.2.91}/scrapling/defaults.py RENAMED Viewed

@@ -1,6 +1,7 @@
-from .fetchers import Fetcher, PlayWrightFetcher, StealthyFetcher
+from .fetchers import AsyncFetcher, Fetcher, PlayWrightFetcher, StealthyFetcher
 # If you are going to use Fetchers with the default settings, import them from this file instead for a cleaner looking code
 Fetcher = Fetcher()
+AsyncFetcher = AsyncFetcher()
 StealthyFetcher = StealthyFetcher()
 PlayWrightFetcher = PlayWrightFetcher()

scrapling 0.2.8__tar.gz → 0.2.91__tar.gz

scrapling 0.2.8tar.gz → 0.2.91tar.gz