PyPI - scrapling - Versions diffs - 0.2.5__tar.gz → 0.2.6__tar.gz - Mend

scrapling 0.2.5tar.gz → 0.2.6tar.gz

Files changed (49) hide show

{scrapling-0.2.5/scrapling.egg-info → scrapling-0.2.6}/PKG-INFO RENAMED Viewed

@@ -1,6 +1,6 @@
 Metadata-Version: 2.1
 Name: scrapling
-Version: 0.2.5
+Version: 0.2.6
 Summary: Scrapling is a powerful, flexible, and high-performance web scraping library for Python. It
 Home-page: https://github.com/D4Vinci/Scrapling
 Author: Karim Shoair
@@ -39,7 +39,7 @@ Requires-Dist: w3lib
 Requires-Dist: orjson>=3
 Requires-Dist: tldextract
 Requires-Dist: httpx[brotli,zstd]
-Requires-Dist: playwright
+Requires-Dist: playwright==1.48
 Requires-Dist: rebrowser-playwright
 Requires-Dist: camoufox>=0.3.10
 Requires-Dist: browserforge
@@ -336,9 +336,11 @@ Using this Fetcher class, you can make requests with:
      * Mimics some of the real browsers' properties by injecting several JS files and using custom options.
      * Using custom flags on launch to hide Playwright even more and make it faster.
      * Generates real browser's headers of the same type and same user OS then append it to the request's headers.
-  3) Real browsers by passing the CDP URL of your browser to be controlled by the Fetcher and most of the options can be enabled on it.
+  3) Real browsers by passing the `real_chrome` argument or the CDP URL of your browser to be controlled by the Fetcher and most of the options can be enabled on it.
   4) [NSTBrowser](https://app.nstbrowser.io/r/1vO5e5)'s [docker browserless](https://hub.docker.com/r/nstbrowser/browserless) option by passing the CDP URL and enabling `nstbrowser_mode` option.
+> Hence using the `real_chrome` argument requires that you have chrome browser installed on your device
 Add that to a lot of controlling/hiding options as you will see in the arguments list below.
 <details><summary><strong>Expand this for the complete list of arguments</strong></summary>
@@ -360,6 +362,7 @@ Add that to a lot of controlling/hiding options as you will see in the arguments
 |     hide_canvas     | Add random noise to canvas operations to prevent fingerprinting.                                                                                                                                                                                                                                                                                                                                                |    ✔️    |
 |    disable_webgl    | Disables WebGL and WebGL 2.0 support entirely.                                                                                                                                                                                                                                                                                                                                                                  |    ✔️    |
 |       stealth       | Enables stealth mode, always check the documentation to see what stealth mode does currently.                                                                                                                                                                                                                                                                                                                   |    ✔️    |
+|     real_chrome     | If you have chrome browser installed on your device, enable this and the Fetcher will launch an instance of your browser and use it.                                                                                                                                                                                                                                                                            |    ✔️    |
 |       cdp_url       | Instead of launching a new browser instance, connect to this CDP URL to control real browsers/NSTBrowser through CDP.                                                                                                                                                                                                                                                                                           |    ✔️    |
 |   nstbrowser_mode   | Enables NSTBrowser mode, **it have to be used with `cdp_url` argument or it will get completely ignored.**                                                                                                                                                                                                                                                                                                      |    ✔️    |
 |  nstbrowser_config  | The config you want to send with requests to the NSTBrowser. _If left empty, Scrapling defaults to an optimized NSTBrowser's docker browserless config._                                                                                                                                                                                                                                                        |    ✔️    |

{scrapling-0.2.5 → scrapling-0.2.6}/README.md RENAMED Viewed

@@ -290,9 +290,11 @@ Using this Fetcher class, you can make requests with:
      * Mimics some of the real browsers' properties by injecting several JS files and using custom options.
      * Using custom flags on launch to hide Playwright even more and make it faster.
      * Generates real browser's headers of the same type and same user OS then append it to the request's headers.
-  3) Real browsers by passing the CDP URL of your browser to be controlled by the Fetcher and most of the options can be enabled on it.
+  3) Real browsers by passing the `real_chrome` argument or the CDP URL of your browser to be controlled by the Fetcher and most of the options can be enabled on it.
   4) [NSTBrowser](https://app.nstbrowser.io/r/1vO5e5)'s [docker browserless](https://hub.docker.com/r/nstbrowser/browserless) option by passing the CDP URL and enabling `nstbrowser_mode` option.
+> Hence using the `real_chrome` argument requires that you have chrome browser installed on your device
 Add that to a lot of controlling/hiding options as you will see in the arguments list below.
 <details><summary><strong>Expand this for the complete list of arguments</strong></summary>
@@ -314,6 +316,7 @@ Add that to a lot of controlling/hiding options as you will see in the arguments
 |     hide_canvas     | Add random noise to canvas operations to prevent fingerprinting.                                                                                                                                                                                                                                                                                                                                                |    ✔️    |
 |    disable_webgl    | Disables WebGL and WebGL 2.0 support entirely.                                                                                                                                                                                                                                                                                                                                                                  |    ✔️    |
 |       stealth       | Enables stealth mode, always check the documentation to see what stealth mode does currently.                                                                                                                                                                                                                                                                                                                   |    ✔️    |
+|     real_chrome     | If you have chrome browser installed on your device, enable this and the Fetcher will launch an instance of your browser and use it.                                                                                                                                                                                                                                                                            |    ✔️    |
 |       cdp_url       | Instead of launching a new browser instance, connect to this CDP URL to control real browsers/NSTBrowser through CDP.                                                                                                                                                                                                                                                                                           |    ✔️    |
 |   nstbrowser_mode   | Enables NSTBrowser mode, **it have to be used with `cdp_url` argument or it will get completely ignored.**                                                                                                                                                                                                                                                                                                      |    ✔️    |
 |  nstbrowser_config  | The config you want to send with requests to the NSTBrowser. _If left empty, Scrapling defaults to an optimized NSTBrowser's docker browserless config._                                                                                                                                                                                                                                                        |    ✔️    |

{scrapling-0.2.5 → scrapling-0.2.6}/scrapling/__init__.py RENAMED Viewed

@@ -4,7 +4,7 @@ from scrapling.parser import Adaptor, Adaptors
 from scrapling.core.custom_types import TextHandler, AttributesHandler
 __author__ = "Karim Shoair (karim.shoair@pm.me)"
-__version__ = "0.2.5"
+__version__ = "0.2.6"
 __copyright__ = "Copyright (c) 2024 Karim Shoair"

{scrapling-0.2.5 → scrapling-0.2.6}/scrapling/engines/pw.py RENAMED Viewed

@@ -27,11 +27,12 @@ class PlaywrightEngine:
             page_action: Callable = do_nothing,
             wait_selector: Optional[str] = None,
             wait_selector_state: Optional[str] = 'attached',
-            stealth: bool = False,
-            hide_canvas: bool = True,
-            disable_webgl: bool = False,
+            stealth: Optional[bool] = False,
+            real_chrome: Optional[bool] = False,
+            hide_canvas: Optional[bool] = False,
+            disable_webgl: Optional[bool] = False,
             cdp_url: Optional[str] = None,
-            nstbrowser_mode: bool = False,
+            nstbrowser_mode: Optional[bool] = False,
             nstbrowser_config: Optional[Dict] = None,
             google_search: Optional[bool] = True,
             extra_headers: Optional[Dict[str, str]] = None,
@@ -51,6 +52,7 @@ class PlaywrightEngine:
         :param wait_selector: Wait for a specific css selector to be in a specific state.
         :param wait_selector_state: The state to wait for the selector given with `wait_selector`. Default state is `attached`.
         :param stealth: Enables stealth mode, check the documentation to see what stealth mode does currently.
+        :param real_chrome: If you have chrome browser installed on your device, enable this and the Fetcher will launch an instance of your browser and use it.
         :param hide_canvas: Add random noise to canvas operations to prevent fingerprinting.
         :param disable_webgl: Disables WebGL and WebGL 2.0 support entirely.
         :param cdp_url: Instead of launching a new browser instance, connect to this CDP URL to control real browsers/NSTBrowser through CDP.
@@ -67,6 +69,7 @@ class PlaywrightEngine:
         self.stealth = bool(stealth)
         self.hide_canvas = bool(hide_canvas)
         self.disable_webgl = bool(disable_webgl)
+        self.real_chrome = bool(real_chrome)
         self.google_search = bool(google_search)
         self.extra_headers = extra_headers or {}
         self.proxy = construct_proxy_dict(proxy)
@@ -119,7 +122,8 @@ class PlaywrightEngine:
         :param url: Target url.
         :return: A `Response` object that is the same as `Adaptor` object except it has these added attributes: `status`, `reason`, `cookies`, `headers`, and `request_headers`
         """
-        if not self.stealth:
+        if not self.stealth or self.real_chrome:
+            # Because rebrowser_playwright doesn't play well with real browsers
             from playwright.sync_api import sync_playwright
         else:
             from rebrowser_playwright.sync_api import sync_playwright
@@ -130,8 +134,8 @@ class PlaywrightEngine:
                 extra_headers = {}
                 useragent = self.useragent
             else:
-                extra_headers = generate_headers(browser_mode=True)
-                useragent = extra_headers.get('User-Agent')
+                extra_headers = {}
+                useragent = generate_headers(browser_mode=True).get('User-Agent')
             # Prepare the flags before diving
             flags = DEFAULT_STEALTH_FLAGS
@@ -146,9 +150,11 @@ class PlaywrightEngine:
                 browser = p.chromium.connect_over_cdp(endpoint_url=cdp_url)
             else:
                 if self.stealth:
-                    browser = p.chromium.launch(headless=self.headless, args=flags, ignore_default_args=['--enable-automation'], chromium_sandbox=True)
+                    browser = p.chromium.launch(
+                        headless=self.headless, args=flags, ignore_default_args=['--enable-automation'], chromium_sandbox=True, channel='chrome' if self.real_chrome else 'chromium'
+                    )
                 else:
-                    browser = p.chromium.launch(headless=self.headless, ignore_default_args=['--enable-automation'])
+                    browser = p.chromium.launch(headless=self.headless, ignore_default_args=['--enable-automation'], channel='chrome' if self.real_chrome else 'chromium')
             # Creating the context
             if self.stealth:

{scrapling-0.2.5 → scrapling-0.2.6}/scrapling/engines/toolbelt/fingerprints.py RENAMED Viewed

@@ -67,7 +67,7 @@ def generate_headers(browser_mode: bool = False) -> Dict:
         # So we don't raise any inconsistency red flags while websites fingerprinting us
         os_name = get_os_name()
         return HeaderGenerator(
-            browser=[Browser(name='chrome', min_version=128)],
+            browser=[Browser(name='chrome', min_version=130)],
             os=os_name,  # None is ignored
             device='desktop'
         ).generate()

{scrapling-0.2.5 → scrapling-0.2.6}/scrapling/fetchers.py RENAMED Viewed

@@ -138,7 +138,7 @@ class PlayWrightFetcher(BaseFetcher):
                 2) Mimics some of the real browsers' properties by injecting several JS files and using custom options.
                 3) Using custom flags on launch to hide Playwright even more and make it faster.
                 4) Generates real browser's headers of the same type and same user OS then append it to the request.
-        - Real browsers by passing the CDP URL of your browser to be controlled by the Fetcher and most of the options can be enabled on it.
+        - Real browsers by passing the `real_chrome` argument or the CDP URL of your browser to be controlled by the Fetcher and most of the options can be enabled on it.
         - NSTBrowser's docker browserless option by passing the CDP URL and enabling `nstbrowser_mode` option.
     > Note that these are the main options with PlayWright but it can be mixed together.
@@ -146,12 +146,12 @@ class PlayWrightFetcher(BaseFetcher):
     def fetch(
             self, url: str, headless: Union[bool, str] = True, disable_resources: bool = None,
             useragent: Optional[str] = None, network_idle: Optional[bool] = False, timeout: Optional[float] = 30000,
-            page_action: Callable = do_nothing, wait_selector: Optional[str] = None, wait_selector_state: Optional[str] = 'attached',
-            hide_canvas: bool = True, disable_webgl: bool = False, extra_headers: Optional[Dict[str, str]] = None, google_search: Optional[bool] = True,
+            page_action: Optional[Callable] = do_nothing, wait_selector: Optional[str] = None, wait_selector_state: Optional[str] = 'attached',
+            hide_canvas: Optional[bool] = False, disable_webgl: Optional[bool] = False, extra_headers: Optional[Dict[str, str]] = None, google_search: Optional[bool] = True,
             proxy: Optional[Union[str, Dict[str, str]]] = None,
-            stealth: bool = False,
+            stealth: Optional[bool] = False, real_chrome: Optional[bool] = False,
             cdp_url: Optional[str] = None,
-            nstbrowser_mode: bool = False, nstbrowser_config: Optional[Dict] = None,
+            nstbrowser_mode: Optional[bool] = False, nstbrowser_config: Optional[Dict] = None,
     ) -> Response:
         """Opens up a browser and do your request based on your chosen options below.
@@ -167,6 +167,7 @@ class PlayWrightFetcher(BaseFetcher):
         :param wait_selector: Wait for a specific css selector to be in a specific state.
         :param wait_selector_state: The state to wait for the selector given with `wait_selector`. Default state is `attached`.
         :param stealth: Enables stealth mode, check the documentation to see what stealth mode does currently.
+        :param real_chrome: If you have chrome browser installed on your device, enable this and the Fetcher will launch an instance of your browser and use it.
         :param hide_canvas: Add random noise to canvas operations to prevent fingerprinting.
         :param disable_webgl: Disables WebGL and WebGL 2.0 support entirely.
         :param google_search: Enabled by default, Scrapling will set the referer header to be as if this request came from a Google search for this website's domain name.
@@ -184,6 +185,7 @@ class PlayWrightFetcher(BaseFetcher):
             cdp_url=cdp_url,
             headless=headless,
             useragent=useragent,
+            real_chrome=real_chrome,
             page_action=page_action,
             hide_canvas=hide_canvas,
             network_idle=network_idle,

{scrapling-0.2.5 → scrapling-0.2.6/scrapling.egg-info}/PKG-INFO RENAMED Viewed

@@ -1,6 +1,6 @@
 Metadata-Version: 2.1
 Name: scrapling
-Version: 0.2.5
+Version: 0.2.6
 Summary: Scrapling is a powerful, flexible, and high-performance web scraping library for Python. It
 Home-page: https://github.com/D4Vinci/Scrapling
 Author: Karim Shoair
@@ -39,7 +39,7 @@ Requires-Dist: w3lib
 Requires-Dist: orjson>=3
 Requires-Dist: tldextract
 Requires-Dist: httpx[brotli,zstd]
-Requires-Dist: playwright
+Requires-Dist: playwright==1.48
 Requires-Dist: rebrowser-playwright
 Requires-Dist: camoufox>=0.3.10
 Requires-Dist: browserforge
@@ -336,9 +336,11 @@ Using this Fetcher class, you can make requests with:
      * Mimics some of the real browsers' properties by injecting several JS files and using custom options.
      * Using custom flags on launch to hide Playwright even more and make it faster.
      * Generates real browser's headers of the same type and same user OS then append it to the request's headers.
-  3) Real browsers by passing the CDP URL of your browser to be controlled by the Fetcher and most of the options can be enabled on it.
+  3) Real browsers by passing the `real_chrome` argument or the CDP URL of your browser to be controlled by the Fetcher and most of the options can be enabled on it.
   4) [NSTBrowser](https://app.nstbrowser.io/r/1vO5e5)'s [docker browserless](https://hub.docker.com/r/nstbrowser/browserless) option by passing the CDP URL and enabling `nstbrowser_mode` option.
+> Hence using the `real_chrome` argument requires that you have chrome browser installed on your device
 Add that to a lot of controlling/hiding options as you will see in the arguments list below.
 <details><summary><strong>Expand this for the complete list of arguments</strong></summary>
@@ -360,6 +362,7 @@ Add that to a lot of controlling/hiding options as you will see in the arguments
 |     hide_canvas     | Add random noise to canvas operations to prevent fingerprinting.                                                                                                                                                                                                                                                                                                                                                |    ✔️    |
 |    disable_webgl    | Disables WebGL and WebGL 2.0 support entirely.                                                                                                                                                                                                                                                                                                                                                                  |    ✔️    |
 |       stealth       | Enables stealth mode, always check the documentation to see what stealth mode does currently.                                                                                                                                                                                                                                                                                                                   |    ✔️    |
+|     real_chrome     | If you have chrome browser installed on your device, enable this and the Fetcher will launch an instance of your browser and use it.                                                                                                                                                                                                                                                                            |    ✔️    |
 |       cdp_url       | Instead of launching a new browser instance, connect to this CDP URL to control real browsers/NSTBrowser through CDP.                                                                                                                                                                                                                                                                                           |    ✔️    |
 |   nstbrowser_mode   | Enables NSTBrowser mode, **it have to be used with `cdp_url` argument or it will get completely ignored.**                                                                                                                                                                                                                                                                                                      |    ✔️    |
 |  nstbrowser_config  | The config you want to send with requests to the NSTBrowser. _If left empty, Scrapling defaults to an optimized NSTBrowser's docker browserless config._                                                                                                                                                                                                                                                        |    ✔️    |

{scrapling-0.2.5 → scrapling-0.2.6}/scrapling.egg-info/requires.txt RENAMED Viewed

@@ -5,7 +5,7 @@ w3lib
 orjson>=3
 tldextract
 httpx[brotli,zstd]
-playwright
+playwright==1.48
 rebrowser-playwright
 camoufox>=0.3.10
 browserforge

{scrapling-0.2.5 → scrapling-0.2.6}/setup.cfg RENAMED Viewed

@@ -1,6 +1,6 @@
 [metadata]
 name = scrapling
-version = 0.2.5
+version = 0.2.6
 author = Karim Shoair
 author_email = karim.shoair@pm.me
 description = Scrapling is an undetectable, powerful, flexible, adaptive, and high-performance web scraping library for Python.

{scrapling-0.2.5 → scrapling-0.2.6}/setup.py RENAMED Viewed

@@ -6,7 +6,7 @@ with open("README.md", "r", encoding="utf-8") as fh:
 setup(
     name="scrapling",
-    version="0.2.5",
+    version="0.2.6",
     description="""Scrapling is a powerful, flexible, and high-performance web scraping library for Python. It
     simplifies the process of extracting data from websites, even when they undergo structural changes, and offers
     impressive speed improvements over many popular scraping tools.""",
@@ -55,7 +55,7 @@ setup(
         "orjson>=3",
         "tldextract",
         'httpx[brotli,zstd]',
-        'playwright',
+        'playwright==1.48',  # Temporary because currently All libraries that provide CDP patches doesn't support playwright 1.49 yet
         'rebrowser-playwright',
         'camoufox>=0.3.10',
         'browserforge',