PyPI - scrapling - Versions diffs - 0.2.6__tar.gz → 0.2.8__tar.gz - Mend

scrapling 0.2.6tar.gz → 0.2.8tar.gz

Files changed (50) hide show

{scrapling-0.2.6/scrapling.egg-info → scrapling-0.2.8}/PKG-INFO RENAMED Viewed

@@ -1,7 +1,7 @@
 Metadata-Version: 2.1
 Name: scrapling
-Version: 0.2.6
-Summary: Scrapling is a powerful, flexible, and high-performance web scraping library for Python. It
+Version: 0.2.8
+Summary: Scrapling is a powerful, flexible, and high-performance web scraping library for Python. It
 Home-page: https://github.com/D4Vinci/Scrapling
 Author: Karim Shoair
 Author-email: karim.shoair@pm.me
@@ -41,7 +41,7 @@ Requires-Dist: tldextract
 Requires-Dist: httpx[brotli,zstd]
 Requires-Dist: playwright==1.48
 Requires-Dist: rebrowser-playwright
-Requires-Dist: camoufox>=0.3.10
+Requires-Dist: camoufox>=0.4.4
 Requires-Dist: browserforge
 # 🕷️ Scrapling: Undetectable, Lightning-Fast, and Adaptive Web Scraping for Python
@@ -52,7 +52,7 @@ Dealing with failing web scrapers due to anti-bot protections or website changes
 Scrapling is a high-performance, intelligent web scraping library for Python that automatically adapts to website changes while significantly outperforming popular alternatives. For both beginners and experts, Scrapling provides powerful features while maintaining simplicity.
 ```python
->> from scrapling.default import Fetcher, StealthyFetcher, PlayWrightFetcher
+>> from scrapling.defaults import Fetcher, StealthyFetcher, PlayWrightFetcher
 # Fetch websites' source under the radar!
 >> page = StealthyFetcher.fetch('https://example.com', headless=True, network_idle=True)
 >> print(page.status)
@@ -90,10 +90,11 @@ Scrapling is a high-performance, intelligent web scraping library for Python tha
     * [Text Extraction Speed Test (5000 nested elements).](#text-extraction-speed-test-5000-nested-elements)
     * [Extraction By Text Speed Test](#extraction-by-text-speed-test)
   * [Installation](#installation)
-  * [Fetching Websites Features](#fetching-websites-features)
-    * [Fetcher](#fetcher)
-    * [StealthyFetcher](#stealthyfetcher)
-    * [PlayWrightFetcher](#playwrightfetcher)
+  * [Fetching Websites](#fetching-websites)
+    * [Features](#features)
+    * [Fetcher class](#fetcher)
+    * [StealthyFetcher class](#stealthyfetcher)
+    * [PlayWrightFetcher class](#playwrightfetcher)
   * [Advanced Parsing Features](#advanced-parsing-features)
     * [Smart Navigation](#smart-navigation)
     * [Content-based Selection & Finding Similar Elements](#content-based-selection--finding-similar-elements)
@@ -256,43 +257,48 @@ playwright install chromium
 python -m browserforge update
 ```
-## Fetching Websites Features
-You might be a little bit confused by now so let me clear things up. All fetcher-type classes are imported in the same way
+## Fetching Websites
+Fetchers are basically interfaces that do requests or fetch pages for you in a single request fashion and then return an `Adaptor` object for you. This feature was introduced because the only option we had before was to fetch the page as you wanted it, then pass it manually to the `Adaptor` class to create an `Adaptor` instance and start playing around with the page.
+### Features
+You might be slightly confused by now so let me clear things up. All fetcher-type classes are imported in the same way
 ```python
 from scrapling import Fetcher, StealthyFetcher, PlayWrightFetcher
 ```
-And all of them can take these initialization arguments: `auto_match`, `huge_tree`, `keep_comments`, `storage`, `storage_args`, and `debug` which are the same ones you give to the `Adaptor` class.
+All of them can take these initialization arguments: `auto_match`, `huge_tree`, `keep_comments`, `storage`, `storage_args`, and `debug`, which are the same ones you give to the `Adaptor` class.
 If you don't want to pass arguments to the generated `Adaptor` object and want to use the default values, you can use this import instead for cleaner code:
 ```python
-from scrapling.default import Fetcher, StealthyFetcher, PlayWrightFetcher
+from scrapling.defaults import Fetcher, StealthyFetcher, PlayWrightFetcher
 ```
 then use it right away without initializing like:
 ```python
 page = StealthyFetcher.fetch('https://example.com')
 ```
-Also, the `Response` object returned from all fetchers is the same as `Adaptor` object except it has these added attributes: `status`, `reason`, `cookies`, `headers`, and `request_headers`. All `cookies`, `headers`, and `request_headers` are always of type `dictionary`.
+Also, the `Response` object returned from all fetchers is the same as the `Adaptor` object except it has these added attributes: `status`, `reason`, `cookies`, `headers`, and `request_headers`. All `cookies`, `headers`, and `request_headers` are always of type `dictionary`.
 > [!NOTE]
 > The `auto_match` argument is enabled by default which is the one you should care about the most as you will see later.
 ### Fetcher
 This class is built on top of [httpx](https://www.python-httpx.org/) with additional configuration options, here you can do `GET`, `POST`, `PUT`, and `DELETE` requests.
 For all methods, you have `stealth_headers` which makes `Fetcher` create and use real browser's headers then create a referer header as if this request came from Google's search of this URL's domain. It's enabled by default.
+You can route all traffic (HTTP and HTTPS) to a proxy for any of these methods in this format `http://username:password@localhost:8030`
 ```python
 >> page = Fetcher().get('https://httpbin.org/get', stealth_headers=True, follow_redirects=True)
->> page = Fetcher().post('https://httpbin.org/post', data={'key': 'value'})
+>> page = Fetcher().post('https://httpbin.org/post', data={'key': 'value'}, proxy='http://username:password@localhost:8030')
 >> page = Fetcher().put('https://httpbin.org/put', data={'key': 'value'})
 >> page = Fetcher().delete('https://httpbin.org/delete')
 ```
 ### StealthyFetcher
-This class is built on top of [Camoufox](https://github.com/daijro/camoufox) which by default bypasses most of the anti-bot protections. Scrapling adds extra layers of flavors and configurations to increase performance and undetectability even further.
+This class is built on top of [Camoufox](https://github.com/daijro/camoufox), bypassing most anti-bot protections by default. Scrapling adds extra layers of flavors and configurations to increase performance and undetectability even further.
 ```python
 >> page = StealthyFetcher().fetch('https://www.browserscan.net/bot-detection')  # Running headless by default
 >> page.status == 200
 True
 ```
-> Note: all requests done by this fetcher is waiting by default for all JS to be fully loaded and executed so you don't have to :)
+> Note: all requests done by this fetcher are waiting by default for all JS to be fully loaded and executed so you don't have to :)
 <details><summary><strong>For the sake of simplicity, expand this for the complete list of arguments</strong></summary>
@@ -309,6 +315,7 @@ True
 |       addons        | List of Firefox addons to use. **Must be paths to extracted addons.**                                                                                                                                                                                                                                                                                                                                           |    ✔️    |
 |      humanize       | Humanize the cursor movement. Takes either True or the MAX duration in seconds of the cursor movement. The cursor typically takes up to 1.5 seconds to move across the window.                                                                                                                                                                                                                                  |    ✔️    |
 |     allow_webgl     | Whether to allow WebGL. To prevent leaks, only use this for special cases.                                                                                                                                                                                                                                                                                                                                      |    ✔️    |
+|     disable_ads     | Enabled by default, this installs `uBlock Origin` addon on the browser if enabled.                                                                                                                                                                                                                                                                                                                              |    ✔️    |
 |    network_idle     | Wait for the page until there are no network connections for at least 500 ms.                                                                                                                                                                                                                                                                                                                                   |    ✔️    |
 |       timeout       | The timeout in milliseconds that is used in all operations and waits through the page. The default is 30000.                                                                                                                                                                                                                                                                                                    |    ✔️    |
 |    wait_selector    | Wait for a specific css selector to be in a specific state.                                                                                                                                                                                                                                                                                                                                                     |    ✔️    |
@@ -327,7 +334,7 @@ This class is built on top of [Playwright](https://playwright.dev/python/) which
 >> page.css_first("#search a::attr(href)")
 'https://github.com/D4Vinci/Scrapling'
 ```
-> Note: all requests done by this fetcher is waiting by default for all JS to be fully loaded and executed so you don't have to :)
+> Note: all requests done by this fetcher are waiting by default for all JS to be fully loaded and executed so you don't have to :)
 Using this Fetcher class, you can make requests with:
   1) Vanilla Playwright without any modifications other than the ones you chose.
@@ -339,7 +346,7 @@ Using this Fetcher class, you can make requests with:
   3) Real browsers by passing the `real_chrome` argument or the CDP URL of your browser to be controlled by the Fetcher and most of the options can be enabled on it.
   4) [NSTBrowser](https://app.nstbrowser.io/r/1vO5e5)'s [docker browserless](https://hub.docker.com/r/nstbrowser/browserless) option by passing the CDP URL and enabling `nstbrowser_mode` option.
-> Hence using the `real_chrome` argument requires that you have chrome browser installed on your device
+> Hence using the `real_chrome` argument requires that you have Chrome browser installed on your device
 Add that to a lot of controlling/hiding options as you will see in the arguments list below.
@@ -362,7 +369,8 @@ Add that to a lot of controlling/hiding options as you will see in the arguments
 |     hide_canvas     | Add random noise to canvas operations to prevent fingerprinting.                                                                                                                                                                                                                                                                                                                                                |    ✔️    |
 |    disable_webgl    | Disables WebGL and WebGL 2.0 support entirely.                                                                                                                                                                                                                                                                                                                                                                  |    ✔️    |
 |       stealth       | Enables stealth mode, always check the documentation to see what stealth mode does currently.                                                                                                                                                                                                                                                                                                                   |    ✔️    |
-|     real_chrome     | If you have chrome browser installed on your device, enable this and the Fetcher will launch an instance of your browser and use it.                                                                                                                                                                                                                                                                            |    ✔️    |
+|     real_chrome     | If you have Chrome browser installed on your device, enable this and the Fetcher will launch an instance of your browser and use it.                                                                                                                                                                                                                                                                            |    ✔️    |
+|       locale        | Set the locale for the browser if wanted. The default value is `en-US`.                                                                                                                                                                                                                                                                                                                                         |    ✔️    |
 |       cdp_url       | Instead of launching a new browser instance, connect to this CDP URL to control real browsers/NSTBrowser through CDP.                                                                                                                                                                                                                                                                                           |    ✔️    |
 |   nstbrowser_mode   | Enables NSTBrowser mode, **it have to be used with `cdp_url` argument or it will get completely ignored.**                                                                                                                                                                                                                                                                                                      |    ✔️    |
 |  nstbrowser_config  | The config you want to send with requests to the NSTBrowser. _If left empty, Scrapling defaults to an optimized NSTBrowser's docker browserless config._                                                                                                                                                                                                                                                        |    ✔️    |
@@ -814,8 +822,7 @@ Of course, you can find elements by text/regex, find similar elements in a more
 Yes, Scrapling instances are thread-safe. Each Adaptor instance maintains its state.
 ## More Sponsors!
-[![Capsolver Banner](https://raw.githubusercontent.com/D4Vinci/Scrapling/main/images/CapSolver.png)](https://www.capsolver.com/?utm_source=github&utm_medium=repo&utm_campaign=scraping&utm_term=Scrapling)
-<a href="https://serpapi.com/?utm_source=scrapling"><img src="https://raw.githubusercontent.com/D4Vinci/Scrapling/main/images/SerpApi.png" height="500" width="500" alt="SerpApi Banner" ></a>
+<a href="https://serpapi.com/?utm_source=scrapling"><img src="https://raw.githubusercontent.com/D4Vinci/Scrapling/main/images/SerpApi.png" height="500" alt="SerpApi Banner" ></a>
 ## Contributing

{scrapling-0.2.6 → scrapling-0.2.8}/README.md RENAMED Viewed

@@ -6,7 +6,7 @@ Dealing with failing web scrapers due to anti-bot protections or website changes
 Scrapling is a high-performance, intelligent web scraping library for Python that automatically adapts to website changes while significantly outperforming popular alternatives. For both beginners and experts, Scrapling provides powerful features while maintaining simplicity.
 ```python
->> from scrapling.default import Fetcher, StealthyFetcher, PlayWrightFetcher
+>> from scrapling.defaults import Fetcher, StealthyFetcher, PlayWrightFetcher
 # Fetch websites' source under the radar!
 >> page = StealthyFetcher.fetch('https://example.com', headless=True, network_idle=True)
 >> print(page.status)
@@ -44,10 +44,11 @@ Scrapling is a high-performance, intelligent web scraping library for Python tha
     * [Text Extraction Speed Test (5000 nested elements).](#text-extraction-speed-test-5000-nested-elements)
     * [Extraction By Text Speed Test](#extraction-by-text-speed-test)
   * [Installation](#installation)
-  * [Fetching Websites Features](#fetching-websites-features)
-    * [Fetcher](#fetcher)
-    * [StealthyFetcher](#stealthyfetcher)
-    * [PlayWrightFetcher](#playwrightfetcher)
+  * [Fetching Websites](#fetching-websites)
+    * [Features](#features)
+    * [Fetcher class](#fetcher)
+    * [StealthyFetcher class](#stealthyfetcher)
+    * [PlayWrightFetcher class](#playwrightfetcher)
   * [Advanced Parsing Features](#advanced-parsing-features)
     * [Smart Navigation](#smart-navigation)
     * [Content-based Selection & Finding Similar Elements](#content-based-selection--finding-similar-elements)
@@ -210,43 +211,48 @@ playwright install chromium
 python -m browserforge update
 ```
-## Fetching Websites Features
-You might be a little bit confused by now so let me clear things up. All fetcher-type classes are imported in the same way
+## Fetching Websites
+Fetchers are basically interfaces that do requests or fetch pages for you in a single request fashion and then return an `Adaptor` object for you. This feature was introduced because the only option we had before was to fetch the page as you wanted it, then pass it manually to the `Adaptor` class to create an `Adaptor` instance and start playing around with the page.
+### Features
+You might be slightly confused by now so let me clear things up. All fetcher-type classes are imported in the same way
 ```python
 from scrapling import Fetcher, StealthyFetcher, PlayWrightFetcher
 ```
-And all of them can take these initialization arguments: `auto_match`, `huge_tree`, `keep_comments`, `storage`, `storage_args`, and `debug` which are the same ones you give to the `Adaptor` class.
+All of them can take these initialization arguments: `auto_match`, `huge_tree`, `keep_comments`, `storage`, `storage_args`, and `debug`, which are the same ones you give to the `Adaptor` class.
 If you don't want to pass arguments to the generated `Adaptor` object and want to use the default values, you can use this import instead for cleaner code:
 ```python
-from scrapling.default import Fetcher, StealthyFetcher, PlayWrightFetcher
+from scrapling.defaults import Fetcher, StealthyFetcher, PlayWrightFetcher
 ```
 then use it right away without initializing like:
 ```python
 page = StealthyFetcher.fetch('https://example.com')
 ```
-Also, the `Response` object returned from all fetchers is the same as `Adaptor` object except it has these added attributes: `status`, `reason`, `cookies`, `headers`, and `request_headers`. All `cookies`, `headers`, and `request_headers` are always of type `dictionary`.
+Also, the `Response` object returned from all fetchers is the same as the `Adaptor` object except it has these added attributes: `status`, `reason`, `cookies`, `headers`, and `request_headers`. All `cookies`, `headers`, and `request_headers` are always of type `dictionary`.
 > [!NOTE]
 > The `auto_match` argument is enabled by default which is the one you should care about the most as you will see later.
 ### Fetcher
 This class is built on top of [httpx](https://www.python-httpx.org/) with additional configuration options, here you can do `GET`, `POST`, `PUT`, and `DELETE` requests.
 For all methods, you have `stealth_headers` which makes `Fetcher` create and use real browser's headers then create a referer header as if this request came from Google's search of this URL's domain. It's enabled by default.
+You can route all traffic (HTTP and HTTPS) to a proxy for any of these methods in this format `http://username:password@localhost:8030`
 ```python
 >> page = Fetcher().get('https://httpbin.org/get', stealth_headers=True, follow_redirects=True)
->> page = Fetcher().post('https://httpbin.org/post', data={'key': 'value'})
+>> page = Fetcher().post('https://httpbin.org/post', data={'key': 'value'}, proxy='http://username:password@localhost:8030')
 >> page = Fetcher().put('https://httpbin.org/put', data={'key': 'value'})
 >> page = Fetcher().delete('https://httpbin.org/delete')
 ```
 ### StealthyFetcher
-This class is built on top of [Camoufox](https://github.com/daijro/camoufox) which by default bypasses most of the anti-bot protections. Scrapling adds extra layers of flavors and configurations to increase performance and undetectability even further.
+This class is built on top of [Camoufox](https://github.com/daijro/camoufox), bypassing most anti-bot protections by default. Scrapling adds extra layers of flavors and configurations to increase performance and undetectability even further.
 ```python
 >> page = StealthyFetcher().fetch('https://www.browserscan.net/bot-detection')  # Running headless by default
 >> page.status == 200
 True
 ```
-> Note: all requests done by this fetcher is waiting by default for all JS to be fully loaded and executed so you don't have to :)
+> Note: all requests done by this fetcher are waiting by default for all JS to be fully loaded and executed so you don't have to :)
 <details><summary><strong>For the sake of simplicity, expand this for the complete list of arguments</strong></summary>
@@ -263,6 +269,7 @@ True
 |       addons        | List of Firefox addons to use. **Must be paths to extracted addons.**                                                                                                                                                                                                                                                                                                                                           |    ✔️    |
 |      humanize       | Humanize the cursor movement. Takes either True or the MAX duration in seconds of the cursor movement. The cursor typically takes up to 1.5 seconds to move across the window.                                                                                                                                                                                                                                  |    ✔️    |
 |     allow_webgl     | Whether to allow WebGL. To prevent leaks, only use this for special cases.                                                                                                                                                                                                                                                                                                                                      |    ✔️    |
+|     disable_ads     | Enabled by default, this installs `uBlock Origin` addon on the browser if enabled.                                                                                                                                                                                                                                                                                                                              |    ✔️    |
 |    network_idle     | Wait for the page until there are no network connections for at least 500 ms.                                                                                                                                                                                                                                                                                                                                   |    ✔️    |
 |       timeout       | The timeout in milliseconds that is used in all operations and waits through the page. The default is 30000.                                                                                                                                                                                                                                                                                                    |    ✔️    |
 |    wait_selector    | Wait for a specific css selector to be in a specific state.                                                                                                                                                                                                                                                                                                                                                     |    ✔️    |
@@ -281,7 +288,7 @@ This class is built on top of [Playwright](https://playwright.dev/python/) which
 >> page.css_first("#search a::attr(href)")
 'https://github.com/D4Vinci/Scrapling'
 ```
-> Note: all requests done by this fetcher is waiting by default for all JS to be fully loaded and executed so you don't have to :)
+> Note: all requests done by this fetcher are waiting by default for all JS to be fully loaded and executed so you don't have to :)
 Using this Fetcher class, you can make requests with:
   1) Vanilla Playwright without any modifications other than the ones you chose.
@@ -293,7 +300,7 @@ Using this Fetcher class, you can make requests with:
   3) Real browsers by passing the `real_chrome` argument or the CDP URL of your browser to be controlled by the Fetcher and most of the options can be enabled on it.
   4) [NSTBrowser](https://app.nstbrowser.io/r/1vO5e5)'s [docker browserless](https://hub.docker.com/r/nstbrowser/browserless) option by passing the CDP URL and enabling `nstbrowser_mode` option.
-> Hence using the `real_chrome` argument requires that you have chrome browser installed on your device
+> Hence using the `real_chrome` argument requires that you have Chrome browser installed on your device
 Add that to a lot of controlling/hiding options as you will see in the arguments list below.
@@ -316,7 +323,8 @@ Add that to a lot of controlling/hiding options as you will see in the arguments
 |     hide_canvas     | Add random noise to canvas operations to prevent fingerprinting.                                                                                                                                                                                                                                                                                                                                                |    ✔️    |
 |    disable_webgl    | Disables WebGL and WebGL 2.0 support entirely.                                                                                                                                                                                                                                                                                                                                                                  |    ✔️    |
 |       stealth       | Enables stealth mode, always check the documentation to see what stealth mode does currently.                                                                                                                                                                                                                                                                                                                   |    ✔️    |
-|     real_chrome     | If you have chrome browser installed on your device, enable this and the Fetcher will launch an instance of your browser and use it.                                                                                                                                                                                                                                                                            |    ✔️    |
+|     real_chrome     | If you have Chrome browser installed on your device, enable this and the Fetcher will launch an instance of your browser and use it.                                                                                                                                                                                                                                                                            |    ✔️    |
+|       locale        | Set the locale for the browser if wanted. The default value is `en-US`.                                                                                                                                                                                                                                                                                                                                         |    ✔️    |
 |       cdp_url       | Instead of launching a new browser instance, connect to this CDP URL to control real browsers/NSTBrowser through CDP.                                                                                                                                                                                                                                                                                           |    ✔️    |
 |   nstbrowser_mode   | Enables NSTBrowser mode, **it have to be used with `cdp_url` argument or it will get completely ignored.**                                                                                                                                                                                                                                                                                                      |    ✔️    |
 |  nstbrowser_config  | The config you want to send with requests to the NSTBrowser. _If left empty, Scrapling defaults to an optimized NSTBrowser's docker browserless config._                                                                                                                                                                                                                                                        |    ✔️    |
@@ -768,8 +776,7 @@ Of course, you can find elements by text/regex, find similar elements in a more
 Yes, Scrapling instances are thread-safe. Each Adaptor instance maintains its state.
 ## More Sponsors!
-[![Capsolver Banner](https://raw.githubusercontent.com/D4Vinci/Scrapling/main/images/CapSolver.png)](https://www.capsolver.com/?utm_source=github&utm_medium=repo&utm_campaign=scraping&utm_term=Scrapling)
-<a href="https://serpapi.com/?utm_source=scrapling"><img src="https://raw.githubusercontent.com/D4Vinci/Scrapling/main/images/SerpApi.png" height="500" width="500" alt="SerpApi Banner" ></a>
+<a href="https://serpapi.com/?utm_source=scrapling"><img src="https://raw.githubusercontent.com/D4Vinci/Scrapling/main/images/SerpApi.png" height="500" alt="SerpApi Banner" ></a>
 ## Contributing

{scrapling-0.2.6 → scrapling-0.2.8}/scrapling/__init__.py RENAMED Viewed

@@ -1,10 +1,11 @@
 # Declare top-level shortcuts
-from scrapling.fetchers import Fetcher, StealthyFetcher, PlayWrightFetcher, CustomFetcher
+from scrapling.core.custom_types import AttributesHandler, TextHandler
+from scrapling.fetchers import (CustomFetcher, Fetcher, PlayWrightFetcher,
+                                StealthyFetcher)
 from scrapling.parser import Adaptor, Adaptors
-from scrapling.core.custom_types import TextHandler, AttributesHandler
 __author__ = "Karim Shoair (karim.shoair@pm.me)"
-__version__ = "0.2.6"
+__version__ = "0.2.8"
 __copyright__ = "Copyright (c) 2024 Karim Shoair"

{scrapling-0.2.6 → scrapling-0.2.8}/scrapling/core/_types.py RENAMED Viewed

@@ -2,9 +2,8 @@
 Type definitions for type checking purposes.
 """
-from typing import (
-    Dict, Optional, Union, Callable, Any, List, Tuple, Pattern, Generator, Iterable, Type, TYPE_CHECKING, Literal
-)
+from typing import (TYPE_CHECKING, Any, Callable, Dict, Generator, Iterable,
+                    List, Literal, Optional, Pattern, Tuple, Type, Union)
 try:
     from typing import Protocol

{scrapling-0.2.6 → scrapling-0.2.8}/scrapling/core/custom_types.py RENAMED Viewed

@@ -1,13 +1,13 @@
 import re
-from types import MappingProxyType
 from collections.abc import Mapping
+from types import MappingProxyType
-from scrapling.core.utils import _is_iterable, flatten
-from scrapling.core._types import Dict, List, Union, Pattern, SupportsIndex
-from orjson import loads, dumps
+from orjson import dumps, loads
 from w3lib.html import replace_entities as _replace_entities
+from scrapling.core._types import Dict, List, Pattern, SupportsIndex, Union
+from scrapling.core.utils import _is_iterable, flatten
 class TextHandler(str):
     """Extends standard Python string by adding more functionality"""

{scrapling-0.2.6 → scrapling-0.2.8}/scrapling/core/translator.py RENAMED Viewed

@@ -10,15 +10,14 @@ So you don't have to learn a new selectors/api method like what bs4 done with so
 import re
-from w3lib.html import HTML5_WHITESPACE
-from scrapling.core.utils import cache
-from scrapling.core._types import Any, Optional, Protocol, Self
-from cssselect.xpath import ExpressionError
-from cssselect.xpath import XPathExpr as OriginalXPathExpr
 from cssselect import HTMLTranslator as OriginalHTMLTranslator
 from cssselect.parser import Element, FunctionalPseudoElement, PseudoElement
+from cssselect.xpath import ExpressionError
+from cssselect.xpath import XPathExpr as OriginalXPathExpr
+from w3lib.html import HTML5_WHITESPACE
+from scrapling.core._types import Any, Optional, Protocol, Self
+from scrapling.core.utils import cache
 regex = f"[{HTML5_WHITESPACE}]+"
 replace_html5_whitespaces = re.compile(regex).sub

{scrapling-0.2.6 → scrapling-0.2.8}/scrapling/core/utils.py RENAMED Viewed

@@ -1,22 +1,25 @@
-import re
 import logging
+import re
 from itertools import chain
-# Using cache on top of a class is brilliant way to achieve Singleton design pattern without much code
-from functools import lru_cache as cache  # functools.cache is available on Python 3.9+ only so let's keep lru_cache
-from scrapling.core._types import Dict, Iterable, Any, Union
 import orjson
 from lxml import html
+from scrapling.core._types import Any, Dict, Iterable, Union
+# Using cache on top of a class is brilliant way to achieve Singleton design pattern without much code
+# functools.cache is available on Python 3.9+ only so let's keep lru_cache
+from functools import lru_cache as cache  # isort:skip
 html_forbidden = {html.HtmlComment, }
 logging.basicConfig(
-        level=logging.ERROR,
-        format='%(asctime)s - %(levelname)s - %(message)s',
-        handlers=[
-            logging.StreamHandler()
-        ]
-    )
+    level=logging.ERROR,
+    format='%(asctime)s - %(levelname)s - %(message)s',
+    handlers=[
+        logging.StreamHandler()
+    ]
+)
 def is_jsonable(content: Union[bytes, str]) -> bool:
@@ -94,7 +97,7 @@ class _StorageTools:
         parent = element.getparent()
         return tuple(
             (element.tag,) if parent is None else (
-                    cls._get_element_path(parent) + (element.tag,)
+                cls._get_element_path(parent) + (element.tag,)
             )
         )

{scrapling-0.2.6 → scrapling-0.2.8}/scrapling/defaults.py RENAMED Viewed

@@ -1,4 +1,4 @@
-from .fetchers import Fetcher, StealthyFetcher, PlayWrightFetcher
+from .fetchers import Fetcher, PlayWrightFetcher, StealthyFetcher
 # If you are going to use Fetchers with the default settings, import them from this file instead for a cleaner looking code
 Fetcher = Fetcher()

{scrapling-0.2.6 → scrapling-0.2.8}/scrapling/engines/camo.py RENAMED Viewed

@@ -1,19 +1,16 @@
 import logging
-from scrapling.core._types import Union, Callable, Optional, Dict, List, Literal
-from scrapling.engines.toolbelt import (
-    Response,
-    do_nothing,
-    StatusText,
-    get_os_name,
-    intercept_route,
-    check_type_validity,
-    construct_proxy_dict,
-    generate_convincing_referer,
-)
+from camoufox import DefaultAddons
 from camoufox.sync_api import Camoufox
+from scrapling.core._types import (Callable, Dict, List, Literal, Optional,
+                                   Union)
+from scrapling.engines.toolbelt import (Response, StatusText,
+                                        check_type_validity,
+                                        construct_proxy_dict, do_nothing,
+                                        generate_convincing_referer,
+                                        get_os_name, intercept_route)
 class CamoufoxEngine:
     def __init__(
@@ -21,7 +18,8 @@ class CamoufoxEngine:
             block_webrtc: Optional[bool] = False, allow_webgl: Optional[bool] = False, network_idle: Optional[bool] = False, humanize: Optional[Union[bool, float]] = True,
             timeout: Optional[float] = 30000, page_action: Callable = do_nothing, wait_selector: Optional[str] = None, addons: Optional[List[str]] = None,
             wait_selector_state: str = 'attached', google_search: Optional[bool] = True, extra_headers: Optional[Dict[str, str]] = None,
-            proxy: Optional[Union[str, Dict[str, str]]] = None, os_randomize: Optional[bool] = None, adaptor_arguments: Dict = None
+            proxy: Optional[Union[str, Dict[str, str]]] = None, os_randomize: Optional[bool] = None, disable_ads: Optional[bool] = True,
+            adaptor_arguments: Dict = None,
     ):
         """An engine that utilizes Camoufox library, check the `StealthyFetcher` class for more documentation.
@@ -36,6 +34,7 @@ class CamoufoxEngine:
         :param humanize: Humanize the cursor movement. Takes either True or the MAX duration in seconds of the cursor movement. The cursor typically takes up to 1.5 seconds to move across the window.
         :param allow_webgl: Whether to allow WebGL. To prevent leaks, only use this for special cases.
         :param network_idle: Wait for the page until there are no network connections for at least 500 ms.
+        :param disable_ads: Enabled by default, this installs `uBlock Origin` addon on the browser if enabled.
         :param os_randomize: If enabled, Scrapling will randomize the OS fingerprints used. The default is Scrapling matching the fingerprints with the current OS.
         :param timeout: The timeout in milliseconds that is used in all operations and waits through the page. The default is 30000
         :param page_action: Added for automation. A function that takes the `page` object, does the automation you need, then returns `page` again.
@@ -54,6 +53,7 @@ class CamoufoxEngine:
         self.network_idle = bool(network_idle)
         self.google_search = bool(google_search)
         self.os_randomize = bool(os_randomize)
+        self.disable_ads = bool(disable_ads)
         self.extra_headers = extra_headers or {}
         self.proxy = construct_proxy_dict(proxy)
         self.addons = addons or []
@@ -75,9 +75,11 @@ class CamoufoxEngine:
         :param url: Target url.
         :return: A `Response` object that is the same as `Adaptor` object except it has these added attributes: `status`, `reason`, `cookies`, `headers`, and `request_headers`
         """
+        addons = [] if self.disable_ads else [DefaultAddons.UBO]
         with Camoufox(
                 proxy=self.proxy,
                 addons=self.addons,
+                exclude_addons=addons,
                 headless=self.headless,
                 humanize=self.humanize,
                 i_know_what_im_doing=True,  # To turn warnings off with the user configurations
@@ -105,6 +107,11 @@ class CamoufoxEngine:
             if self.wait_selector and type(self.wait_selector) is str:
                 waiter = page.locator(self.wait_selector)
                 waiter.first.wait_for(state=self.wait_selector_state)
+                # Wait again after waiting for the selector, helpful with protections like Cloudflare
+                page.wait_for_load_state(state="load")
+                page.wait_for_load_state(state="domcontentloaded")
+                if self.network_idle:
+                    page.wait_for_load_state('networkidle')
             # This will be parsed inside `Response`
             encoding = res.headers.get('content-type', '') or 'utf-8'  # default encoding

{scrapling-0.2.6 → scrapling-0.2.8}/scrapling/engines/constants.py RENAMED Viewed

@@ -44,7 +44,7 @@ DEFAULT_STEALTH_FLAGS = [
     '--disable-default-apps',
     '--disable-print-preview',
     '--disable-dev-shm-usage',
-    '--disable-popup-blocking',
+    # '--disable-popup-blocking',
     '--metrics-recording-only',
     '--disable-crash-reporter',
     '--disable-partial-raster',

{scrapling-0.2.6 → scrapling-0.2.8}/scrapling/engines/pw.py RENAMED Viewed

@@ -1,20 +1,15 @@
 import json
 import logging
-from scrapling.core._types import Union, Callable, Optional, List, Dict
-from scrapling.engines.constants import DEFAULT_STEALTH_FLAGS, NSTBROWSER_DEFAULT_QUERY
-from scrapling.engines.toolbelt import (
-    Response,
-    do_nothing,
-    StatusText,
-    js_bypass_path,
-    intercept_route,
-    generate_headers,
-    construct_cdp_url,
-    check_type_validity,
-    construct_proxy_dict,
-    generate_convincing_referer,
-)
+from scrapling.core._types import Callable, Dict, List, Optional, Union
+from scrapling.engines.constants import (DEFAULT_STEALTH_FLAGS,
+                                         NSTBROWSER_DEFAULT_QUERY)
+from scrapling.engines.toolbelt import (Response, StatusText,
+                                        check_type_validity, construct_cdp_url,
+                                        construct_proxy_dict, do_nothing,
+                                        generate_convincing_referer,
+                                        generate_headers, intercept_route,
+                                        js_bypass_path)
 class PlaywrightEngine:
@@ -26,6 +21,7 @@ class PlaywrightEngine:
             timeout: Optional[float] = 30000,
             page_action: Callable = do_nothing,
             wait_selector: Optional[str] = None,
+            locale: Optional[str] = 'en-US',
             wait_selector_state: Optional[str] = 'attached',
             stealth: Optional[bool] = False,
             real_chrome: Optional[bool] = False,
@@ -50,6 +46,7 @@ class PlaywrightEngine:
         :param timeout: The timeout in milliseconds that is used in all operations and waits through the page. The default is 30000
         :param page_action: Added for automation. A function that takes the `page` object, does the automation you need, then returns `page` again.
         :param wait_selector: Wait for a specific css selector to be in a specific state.
+        :param locale: Set the locale for the browser if wanted. The default value is `en-US`.
         :param wait_selector_state: The state to wait for the selector given with `wait_selector`. Default state is `attached`.
         :param stealth: Enables stealth mode, check the documentation to see what stealth mode does currently.
         :param real_chrome: If you have chrome browser installed on your device, enable this and the Fetcher will launch an instance of your browser and use it.
@@ -64,6 +61,7 @@ class PlaywrightEngine:
         :param adaptor_arguments: The arguments that will be passed in the end while creating the final Adaptor's class.
         """
         self.headless = headless
+        self.locale = check_type_validity(locale, [str], 'en-US', param_name='locale')
         self.disable_resources = disable_resources
         self.network_idle = bool(network_idle)
         self.stealth = bool(stealth)
@@ -87,6 +85,14 @@ class PlaywrightEngine:
         self.nstbrowser_mode = bool(nstbrowser_mode)
         self.nstbrowser_config = nstbrowser_config
         self.adaptor_arguments = adaptor_arguments if adaptor_arguments else {}
+        self.harmful_default_args = [
+            # This will be ignored to avoid detection more and possibly avoid the popup crashing bug abuse: https://issues.chromium.org/issues/340836884
+            '--enable-automation',
+            '--disable-popup-blocking',
+            # '--disable-component-update',
+            # '--disable-default-apps',
+            # '--disable-extensions',
+        ]
     def _cdp_url_logic(self, flags: Optional[List] = None) -> str:
         """Constructs new CDP URL if NSTBrowser is enabled otherwise return CDP URL as it is
@@ -151,15 +157,15 @@ class PlaywrightEngine:
             else:
                 if self.stealth:
                     browser = p.chromium.launch(
-                        headless=self.headless, args=flags, ignore_default_args=['--enable-automation'], chromium_sandbox=True, channel='chrome' if self.real_chrome else 'chromium'
+                        headless=self.headless, args=flags, ignore_default_args=self.harmful_default_args, chromium_sandbox=True, channel='chrome' if self.real_chrome else 'chromium'
                     )
                 else:
-                    browser = p.chromium.launch(headless=self.headless, ignore_default_args=['--enable-automation'], channel='chrome' if self.real_chrome else 'chromium')
+                    browser = p.chromium.launch(headless=self.headless, ignore_default_args=self.harmful_default_args, channel='chrome' if self.real_chrome else 'chromium')
             # Creating the context
             if self.stealth:
                 context = browser.new_context(
-                    locale='en-US',
+                    locale=self.locale,
                     is_mobile=False,
                     has_touch=False,
                     proxy=self.proxy,
@@ -176,6 +182,8 @@ class PlaywrightEngine:
                 )
             else:
                 context = browser.new_context(
+                    locale=self.locale,
+                    proxy=self.proxy,
                     color_scheme='dark',
                     user_agent=useragent,
                     device_scale_factor=2,
@@ -221,6 +229,11 @@ class PlaywrightEngine:
             if self.wait_selector and type(self.wait_selector) is str:
                 waiter = page.locator(self.wait_selector)
                 waiter.first.wait_for(state=self.wait_selector_state)
+                # Wait again after waiting for the selector, helpful with protections like Cloudflare
+                page.wait_for_load_state(state="load")
+                page.wait_for_load_state(state="domcontentloaded")
+                if self.network_idle:
+                    page.wait_for_load_state('networkidle')
             # This will be parsed inside `Response`
             encoding = res.headers.get('content-type', '') or 'utf-8'  # default encoding

scrapling 0.2.6__tar.gz → 0.2.8__tar.gz

scrapling 0.2.6tar.gz → 0.2.8tar.gz