PyPI - scrapling - Versions diffs - 0.3.3__tar.gz → 0.3.5__tar.gz - Mend

scrapling 0.3.3tar.gz → 0.3.5tar.gz

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (52) hide show

{scrapling-0.3.3/scrapling.egg-info → scrapling-0.3.5}/PKG-INFO RENAMED Viewed

@@ -1,6 +1,6 @@
 Metadata-Version: 2.4
 Name: scrapling
-Version: 0.3.3
+Version: 0.3.5
 Summary: Scrapling is an undetectable, powerful, flexible, high-performance Python library that makes Web Scraping easy and effortless as it should be!
 Home-page: https://github.com/D4Vinci/Scrapling
 Author: Karim Shoair
@@ -69,15 +69,15 @@ Requires-Dist: cssselect>=1.3.0
 Requires-Dist: orjson>=3.11.3
 Requires-Dist: tldextract>=5.3.0
 Provides-Extra: fetchers
-Requires-Dist: click>=8.2.1; extra == "fetchers"
+Requires-Dist: click>=8.3.0; extra == "fetchers"
 Requires-Dist: curl_cffi>=0.13.0; extra == "fetchers"
-Requires-Dist: playwright>=1.52.0; extra == "fetchers"
-Requires-Dist: rebrowser-playwright>=1.52.0; extra == "fetchers"
+Requires-Dist: playwright>=1.55.0; extra == "fetchers"
+Requires-Dist: patchright>=1.55.2; extra == "fetchers"
 Requires-Dist: camoufox>=0.4.11; extra == "fetchers"
 Requires-Dist: geoip2>=5.1.0; extra == "fetchers"
 Requires-Dist: msgspec>=0.19.0; extra == "fetchers"
 Provides-Extra: ai
-Requires-Dist: mcp>=1.14.0; extra == "ai"
+Requires-Dist: mcp>=1.14.1; extra == "ai"
 Requires-Dist: markdownify>=1.2.0; extra == "ai"
 Requires-Dist: scrapling[fetchers]; extra == "ai"
 Provides-Extra: shell
@@ -114,14 +114,6 @@ Dynamic: license-file
 </p>
 <p align="center">
-    <a href="https://scrapling.readthedocs.io/en/latest/#installation">
-        Installation
-    </a>
-    ·
-    <a href="https://scrapling.readthedocs.io/en/latest/overview/">
-        Overview
-    </a>
-    ·
     <a href="https://scrapling.readthedocs.io/en/latest/parsing/selection/">
         Selection methods
     </a>
@@ -130,6 +122,14 @@ Dynamic: license-file
         Choosing a fetcher
     </a>
     ·
+    <a href="https://scrapling.readthedocs.io/en/latest/cli/overview/">
+        CLI
+    </a>
+    ·
+    <a href="https://scrapling.readthedocs.io/en/latest/ai/mcp-server/">
+        MCP mode
+    </a>
+    ·
     <a href="https://scrapling.readthedocs.io/en/latest/tutorials/migrating_from_beautifulsoup/">
         Migrating from Beautifulsoup
     </a>
@@ -157,11 +157,13 @@ Built for the modern Web, Scrapling has its own rapid parsing engine and its fet
 <!-- sponsors -->
+<a href="https://www.thordata.com/?ls=github&lk=D4Vinci" target="_blank" title="A global network of over 60M+ residential proxies with 99.7% availability, ensuring stable and reliable web data scraping to support AI, BI, and workflows."><img src="https://raw.githubusercontent.com/D4Vinci/Scrapling/main/images/thordata.jpg"></a>
 <a href="https://evomi.com?utm_source=github&utm_medium=banner&utm_campaign=d4vinci-scrapling" target="_blank" title="Evomi is your Swiss Quality Proxy Provider, starting at $0.49/GB"><img src="https://raw.githubusercontent.com/D4Vinci/Scrapling/main/images/evomi.png"></a>
+<a href="https://visit.decodo.com/Dy6W0b" target="_blank" title="Try the Most Efficient Residential Proxies for Free"><img src="https://raw.githubusercontent.com/D4Vinci/Scrapling/main/images/decodo.png"></a>
 <a href="https://petrosky.io/d4vinci" target="_blank" title="PetroSky delivers cutting-edge VPS hosting."><img src="https://raw.githubusercontent.com/D4Vinci/Scrapling/main/images/petrosky.png"></a>
 <a href="https://www.swiftproxy.net/" target="_blank" title="Unlock Reliable Proxy Services with Swiftproxy!"><img src="https://raw.githubusercontent.com/D4Vinci/Scrapling/main/images/swiftproxy.png"></a>
-<a href="https://serpapi.com/?utm_source=scrapling" target="_blank" title="Scrape Google and other search engines with SerpApi"><img src="https://raw.githubusercontent.com/D4Vinci/Scrapling/main/images/SerpApi.png"></a>
 <a href="https://www.nstproxy.com/?type=flow&utm_source=scrapling" target="_blank" title="One Proxy Service, Infinite Solutions at Unbeatable Prices!"><img src="https://raw.githubusercontent.com/D4Vinci/Scrapling/main/images/NSTproxy.png"></a>
+<a href="https://serpapi.com/?utm_source=scrapling" target="_blank" title="Scrape Google and other search engines with SerpApi"><img src="https://raw.githubusercontent.com/D4Vinci/Scrapling/main/images/SerpApi.png"></a>
 <!-- /sponsors -->
@@ -410,10 +412,9 @@ This project includes code adapted from:
 ## Thanks and References
 - [Daijro](https://github.com/daijro)'s brilliant work on [BrowserForge](https://github.com/daijro/browserforge) and [Camoufox](https://github.com/daijro/camoufox)
-- [Vinyzu](https://github.com/Vinyzu)'s work on [Botright](https://github.com/Vinyzu/Botright)
+- [Vinyzu](https://github.com/Vinyzu)'s brilliant work on [Botright](https://github.com/Vinyzu/Botright) and [PatchRight](https://github.com/Kaliiiiiiiiii-Vinyzu/patchright)
 - [brotector](https://github.com/kaliiiiiiiiii/brotector) for browser detection bypass techniques
-- [fakebrowser](https://github.com/kkoooqq/fakebrowser) for fingerprinting research
-- [rebrowser-patches](https://github.com/rebrowser/rebrowser-patches) for stealth improvements
+- [fakebrowser](https://github.com/kkoooqq/fakebrowser) and [BotBrowser](https://github.com/botswin/BotBrowser) for fingerprinting research
 ---
 <div align="center"><small>Designed & crafted with ❤️ by Karim Shoair.</small></div><br>

{scrapling-0.3.3 → scrapling-0.3.5}/README.md RENAMED Viewed

@@ -24,14 +24,6 @@
 </p>
 <p align="center">
-    <a href="https://scrapling.readthedocs.io/en/latest/#installation">
-        Installation
-    </a>
-    ·
-    <a href="https://scrapling.readthedocs.io/en/latest/overview/">
-        Overview
-    </a>
-    ·
     <a href="https://scrapling.readthedocs.io/en/latest/parsing/selection/">
         Selection methods
     </a>
@@ -40,6 +32,14 @@
         Choosing a fetcher
     </a>
     ·
+    <a href="https://scrapling.readthedocs.io/en/latest/cli/overview/">
+        CLI
+    </a>
+    ·
+    <a href="https://scrapling.readthedocs.io/en/latest/ai/mcp-server/">
+        MCP mode
+    </a>
+    ·
     <a href="https://scrapling.readthedocs.io/en/latest/tutorials/migrating_from_beautifulsoup/">
         Migrating from Beautifulsoup
     </a>
@@ -67,11 +67,13 @@ Built for the modern Web, Scrapling has its own rapid parsing engine and its fet
 <!-- sponsors -->
+<a href="https://www.thordata.com/?ls=github&lk=D4Vinci" target="_blank" title="A global network of over 60M+ residential proxies with 99.7% availability, ensuring stable and reliable web data scraping to support AI, BI, and workflows."><img src="https://raw.githubusercontent.com/D4Vinci/Scrapling/main/images/thordata.jpg"></a>
 <a href="https://evomi.com?utm_source=github&utm_medium=banner&utm_campaign=d4vinci-scrapling" target="_blank" title="Evomi is your Swiss Quality Proxy Provider, starting at $0.49/GB"><img src="https://raw.githubusercontent.com/D4Vinci/Scrapling/main/images/evomi.png"></a>
+<a href="https://visit.decodo.com/Dy6W0b" target="_blank" title="Try the Most Efficient Residential Proxies for Free"><img src="https://raw.githubusercontent.com/D4Vinci/Scrapling/main/images/decodo.png"></a>
 <a href="https://petrosky.io/d4vinci" target="_blank" title="PetroSky delivers cutting-edge VPS hosting."><img src="https://raw.githubusercontent.com/D4Vinci/Scrapling/main/images/petrosky.png"></a>
 <a href="https://www.swiftproxy.net/" target="_blank" title="Unlock Reliable Proxy Services with Swiftproxy!"><img src="https://raw.githubusercontent.com/D4Vinci/Scrapling/main/images/swiftproxy.png"></a>
-<a href="https://serpapi.com/?utm_source=scrapling" target="_blank" title="Scrape Google and other search engines with SerpApi"><img src="https://raw.githubusercontent.com/D4Vinci/Scrapling/main/images/SerpApi.png"></a>
 <a href="https://www.nstproxy.com/?type=flow&utm_source=scrapling" target="_blank" title="One Proxy Service, Infinite Solutions at Unbeatable Prices!"><img src="https://raw.githubusercontent.com/D4Vinci/Scrapling/main/images/NSTproxy.png"></a>
+<a href="https://serpapi.com/?utm_source=scrapling" target="_blank" title="Scrape Google and other search engines with SerpApi"><img src="https://raw.githubusercontent.com/D4Vinci/Scrapling/main/images/SerpApi.png"></a>
 <!-- /sponsors -->
@@ -320,10 +322,9 @@ This project includes code adapted from:
 ## Thanks and References
 - [Daijro](https://github.com/daijro)'s brilliant work on [BrowserForge](https://github.com/daijro/browserforge) and [Camoufox](https://github.com/daijro/camoufox)
-- [Vinyzu](https://github.com/Vinyzu)'s work on [Botright](https://github.com/Vinyzu/Botright)
+- [Vinyzu](https://github.com/Vinyzu)'s brilliant work on [Botright](https://github.com/Vinyzu/Botright) and [PatchRight](https://github.com/Kaliiiiiiiiii-Vinyzu/patchright)
 - [brotector](https://github.com/kaliiiiiiiiii/brotector) for browser detection bypass techniques
-- [fakebrowser](https://github.com/kkoooqq/fakebrowser) for fingerprinting research
-- [rebrowser-patches](https://github.com/rebrowser/rebrowser-patches) for stealth improvements
+- [fakebrowser](https://github.com/kkoooqq/fakebrowser) and [BotBrowser](https://github.com/botswin/BotBrowser) for fingerprinting research
 ---
 <div align="center"><small>Designed & crafted with ❤️ by Karim Shoair.</small></div><br>

{scrapling-0.3.3 → scrapling-0.3.5}/pyproject.toml RENAMED Viewed

@@ -64,16 +64,16 @@ dependencies = [
 [project.optional-dependencies]
 fetchers = [
-    "click>=8.2.1",
+    "click>=8.3.0",
     "curl_cffi>=0.13.0",
-    "playwright>=1.52.0",
-    "rebrowser-playwright>=1.52.0",
+    "playwright>=1.55.0",
+    "patchright>=1.55.2",
     "camoufox>=0.4.11",
     "geoip2>=5.1.0",
     "msgspec>=0.19.0",
 ]
 ai = [
-    "mcp>=1.14.0",
+    "mcp>=1.14.1",
     "markdownify>=1.2.0",
     "scrapling[fetchers]",
 ]

{scrapling-0.3.3 → scrapling-0.3.5}/scrapling/__init__.py RENAMED Viewed

@@ -1,5 +1,5 @@
 __author__ = "Karim Shoair (karim.shoair@pm.me)"
-__version__ = "0.3.3"
+__version__ = "0.3.5"
 __copyright__ = "Copyright (c) 2024 Karim Shoair"

{scrapling-0.3.3 → scrapling-0.3.5}/scrapling/cli.py RENAMED Viewed

@@ -32,8 +32,8 @@ def __ParseJSONData(json_string: Optional[str] = None) -> Optional[Dict[str, Any
     try:
         return json_loads(json_string)
-    except JSONDecodeError as e:  # pragma: no cover
-        raise ValueError(f"Invalid JSON data '{json_string}': {e}")
+    except JSONDecodeError as err:  # pragma: no cover
+        raise ValueError(f"Invalid JSON data '{json_string}': {err}")
 def __Request_and_Save(
@@ -65,8 +65,8 @@ def __ParseExtractArguments(
         for key, value in _CookieParser(cookies):
             try:
                 parsed_cookies[key] = value
-            except Exception as e:
-                raise ValueError(f"Could not parse cookies '{cookies}': {e}")
+            except Exception as err:
+                raise ValueError(f"Could not parse cookies '{cookies}': {err}")
     parsed_json = __ParseJSONData(json)
     parsed_params = {}

{scrapling-0.3.3 → scrapling-0.3.5}/scrapling/core/custom_types.py RENAMED Viewed

@@ -145,7 +145,7 @@ class TextHandler(str):
         clean_match: bool = False,
         case_sensitive: bool = True,
         check_match: Literal[False] = False,
-    ) -> "TextHandlers[TextHandler]": ...
+    ) -> "TextHandlers": ...
     def re(
         self,
@@ -241,7 +241,7 @@ class TextHandlers(List[TextHandler]):
         replace_entities: bool = True,
         clean_match: bool = False,
         case_sensitive: bool = True,
-    ) -> "TextHandlers[TextHandler]":
+    ) -> "TextHandlers":
         """Call the ``.re()`` method for each element in this list and return
         their results flattened as TextHandlers.

{scrapling-0.3.3 → scrapling-0.3.5}/scrapling/core/shell.py RENAMED Viewed

@@ -201,7 +201,7 @@ class CurlParser:
                 data_payload = parsed_args.data_binary  # Fallback to string
         elif parsed_args.data_raw is not None:
-            data_payload = parsed_args.data_raw
+            data_payload = parsed_args.data_raw.lstrip("$")
         elif parsed_args.data is not None:
             data_payload = parsed_args.data
@@ -318,7 +318,7 @@ def show_page_in_browser(page: Selector):  # pragma: no cover
     try:
         fd, fname = make_temp_file(prefix="scrapling_view_", suffix=".html")
         with open(fd, "w", encoding=page.encoding) as f:
-            f.write(page.body)
+            f.write(page.html_content)
         open_in_browser(f"file://{fname}")
     except IOError as e:
@@ -335,15 +335,25 @@ class CustomShell:
         from scrapling.fetchers import (
             Fetcher as __Fetcher,
             AsyncFetcher as __AsyncFetcher,
+            FetcherSession as __FetcherSession,
             DynamicFetcher as __DynamicFetcher,
+            DynamicSession as __DynamicSession,
+            AsyncDynamicSession as __AsyncDynamicSession,
             StealthyFetcher as __StealthyFetcher,
+            StealthySession as __StealthySession,
+            AsyncStealthySession as __AsyncStealthySession,
         )
         self.__InteractiveShellEmbed = __InteractiveShellEmbed
         self.__Fetcher = __Fetcher
         self.__AsyncFetcher = __AsyncFetcher
+        self.__FetcherSession = __FetcherSession
         self.__DynamicFetcher = __DynamicFetcher
+        self.__DynamicSession = __DynamicSession
+        self.__AsyncDynamicSession = __AsyncDynamicSession
         self.__StealthyFetcher = __StealthyFetcher
+        self.__StealthySession = __StealthySession
+        self.__AsyncStealthySession = __AsyncStealthySession
         self.code = code
         self.page = None
         self.pages = Selectors([])
@@ -379,9 +389,9 @@ class CustomShell:
         """Create a custom banner for the shell"""
         return f"""
 -> Available Scrapling objects:
-   - Fetcher/AsyncFetcher
-   - DynamicFetcher
-   - StealthyFetcher
+   - Fetcher/AsyncFetcher/FetcherSession
+   - DynamicFetcher/DynamicSession/AsyncDynamicSession
+   - StealthyFetcher/StealthySession/AsyncStealthySession
    - Selector
 -> Useful shortcuts:
@@ -449,6 +459,11 @@ Type 'exit' or press Ctrl+D to exit.
             "delete": delete,
             "Fetcher": self.__Fetcher,
             "AsyncFetcher": self.__AsyncFetcher,
+            "FetcherSession": self.__FetcherSession,
+            "DynamicSession": self.__DynamicSession,
+            "AsyncDynamicSession": self.__AsyncDynamicSession,
+            "StealthySession": self.__StealthySession,
+            "AsyncStealthySession": self.__AsyncStealthySession,
             "fetch": dynamic_fetch,
             "DynamicFetcher": self.__DynamicFetcher,
             "stealthy_fetch": stealthy_fetch,
@@ -530,7 +545,7 @@ class Convertor:
             for page in pages:
                 match extraction_type:
                     case "markdown":
-                        yield cls._convert_to_markdown(page.body)
+                        yield cls._convert_to_markdown(page.html_content)
                     case "html":
                         yield page.body
                     case "text":

{scrapling-0.3.3 → scrapling-0.3.5}/scrapling/engines/_browsers/_base.py RENAMED Viewed

@@ -1,4 +1,4 @@
-from time import time, sleep
+from time import time
 from asyncio import sleep as asyncio_sleep, Lock
 from camoufox import DefaultAddons
@@ -31,7 +31,7 @@ class SyncSession:
     def __init__(self, max_pages: int = 1):
         self.max_pages = max_pages
         self.page_pool = PagePool(max_pages)
-        self.__max_wait_for_page = 60
+        self._max_wait_for_page = 60
         self.playwright: Optional[Playwright] = None
         self.context: Optional[BrowserContext] = None
         self._closed = False
@@ -44,23 +44,7 @@ class SyncSession:
     ) -> PageInfo:  # pragma: no cover
         """Get a new page to use"""
-        # Close all finished pages to ensure clean state
-        self.page_pool.close_all_finished_pages()
-        # If we're at max capacity after cleanup, wait for busy pages to finish
-        if self.page_pool.pages_count >= self.max_pages:
-            start_time = time()
-            while time() - start_time < self.__max_wait_for_page:
-                # Wait for any pages to finish, then clean them up
-                sleep(0.05)
-                self.page_pool.close_all_finished_pages()
-                if self.page_pool.pages_count < self.max_pages:
-                    break
-            else:
-                raise TimeoutError(
-                    f"No pages finished to clear place in the pool within the {self.__max_wait_for_page}s timeout period"
-                )
+        # No need to check if a page is available or not in sync code because the code blocked before reaching here till the page closed, ofc.
         page = self.context.new_page()
         page.set_default_navigation_timeout(timeout)
         page.set_default_timeout(timeout)
@@ -76,11 +60,6 @@ class SyncSession:
         return self.page_pool.add_page(page)
-    @staticmethod
-    def _get_with_precedence(request_value: Any, session_value: Any, sentinel_value: object) -> Any:
-        """Get value with request-level priority over session-level"""
-        return request_value if request_value is not sentinel_value else session_value
     def get_pool_stats(self) -> Dict[str, int]:
         """Get statistics about the current page pool"""
         return {
@@ -105,21 +84,16 @@ class AsyncSession(SyncSession):
     ) -> PageInfo:  # pragma: no cover
         """Get a new page to use"""
         async with self._lock:
-            # Close all finished pages to ensure clean state
-            await self.page_pool.aclose_all_finished_pages()
             # If we're at max capacity after cleanup, wait for busy pages to finish
             if self.page_pool.pages_count >= self.max_pages:
                 start_time = time()
-                while time() - start_time < self.__max_wait_for_page:
-                    # Wait for any pages to finish, then clean them up
+                while time() - start_time < self._max_wait_for_page:
                     await asyncio_sleep(0.05)
-                    await self.page_pool.aclose_all_finished_pages()
                     if self.page_pool.pages_count < self.max_pages:
                         break
                 else:
                     raise TimeoutError(
-                        f"No pages finished to clear place in the pool within the {self.__max_wait_for_page}s timeout period"
+                        f"No pages finished to clear place in the pool within the {self._max_wait_for_page}s timeout period"
                     )
             page = await self.context.new_page()

{scrapling-0.3.3 → scrapling-0.3.5}/scrapling/engines/_browsers/_camoufox.py RENAMED Viewed

@@ -14,8 +14,9 @@ from playwright.async_api import (
     Locator as AsyncLocator,
     Page as async_Page,
 )
+from playwright._impl._errors import Error as PlaywrightError
-from ._validators import validate, CamoufoxConfig
+from ._validators import validate_fetch as _validate
 from ._base import SyncSession, AsyncSession, StealthySessionMixin
 from scrapling.core.utils import log
 from scrapling.core._types import (
@@ -201,20 +202,34 @@ class StealthySession(StealthySessionMixin, SyncSession):
         self._closed = True
+    @staticmethod
+    def _get_page_content(page: Page) -> str | None:
+        """
+        A workaround for Playwright issue with `page.content()` on Windows. Ref.: https://github.com/microsoft/playwright/issues/16108
+        :param page: The page to extract content from.
+        :return:
+        """
+        while True:
+            try:
+                return page.content() or ""
+            except PlaywrightError:
+                page.wait_for_timeout(1000)
+                continue
     def _solve_cloudflare(self, page: Page) -> None:  # pragma: no cover
         """Solve the cloudflare challenge displayed on the playwright page passed
         :param page: The targeted page
         :return:
         """
-        challenge_type = self._detect_cloudflare(page.content())
+        challenge_type = self._detect_cloudflare(self._get_page_content(page))
         if not challenge_type:
             log.error("No Cloudflare challenge found.")
             return
         else:
             log.info(f'The turnstile version discovered is "{challenge_type}"')
             if challenge_type == "non-interactive":
-                while "<title>Just a moment...</title>" in (page.content()):
+                while "<title>Just a moment...</title>" in (self._get_page_content(page)):
                     log.info("Waiting for Cloudflare wait page to disappear.")
                     page.wait_for_timeout(1000)
                     page.wait_for_load_state()
@@ -222,7 +237,7 @@ class StealthySession(StealthySessionMixin, SyncSession):
                 return
             else:
-                while "Verifying you are human." in page.content():
+                while "Verifying you are human." in self._get_page_content(page):
                     # Waiting for the verify spinner to disappear, checking every 1s if it disappeared
                     page.wait_for_timeout(500)
@@ -282,23 +297,22 @@ class StealthySession(StealthySessionMixin, SyncSession):
         :param selector_config: The arguments that will be passed in the end while creating the final Selector's class.
         :return: A `Response` object.
         """
-        # Validate all resolved parameters
-        params = validate(
-            dict(
-                google_search=self._get_with_precedence(google_search, self.google_search, _UNSET),
-                timeout=self._get_with_precedence(timeout, self.timeout, _UNSET),
-                wait=self._get_with_precedence(wait, self.wait, _UNSET),
-                page_action=self._get_with_precedence(page_action, self.page_action, _UNSET),
-                extra_headers=self._get_with_precedence(extra_headers, self.extra_headers, _UNSET),
-                disable_resources=self._get_with_precedence(disable_resources, self.disable_resources, _UNSET),
-                wait_selector=self._get_with_precedence(wait_selector, self.wait_selector, _UNSET),
-                wait_selector_state=self._get_with_precedence(wait_selector_state, self.wait_selector_state, _UNSET),
-                network_idle=self._get_with_precedence(network_idle, self.network_idle, _UNSET),
-                load_dom=self._get_with_precedence(load_dom, self.load_dom, _UNSET),
-                solve_cloudflare=self._get_with_precedence(solve_cloudflare, self.solve_cloudflare, _UNSET),
-                selector_config=self._get_with_precedence(selector_config, self.selector_config, _UNSET),
-            ),
-            CamoufoxConfig,
+        params = _validate(
+            [
+                ("google_search", google_search, self.google_search),
+                ("timeout", timeout, self.timeout),
+                ("wait", wait, self.wait),
+                ("page_action", page_action, self.page_action),
+                ("extra_headers", extra_headers, self.extra_headers),
+                ("disable_resources", disable_resources, self.disable_resources),
+                ("wait_selector", wait_selector, self.wait_selector),
+                ("wait_selector_state", wait_selector_state, self.wait_selector_state),
+                ("network_idle", network_idle, self.network_idle),
+                ("load_dom", load_dom, self.load_dom),
+                ("solve_cloudflare", solve_cloudflare, self.solve_cloudflare),
+                ("selector_config", selector_config, self.selector_config),
+            ],
+            _UNSET,
         )
         if self._closed:  # pragma: no cover
@@ -366,8 +380,9 @@ class StealthySession(StealthySessionMixin, SyncSession):
                 page_info.page, first_response, final_response, params.selector_config
             )
-            # Mark the page as finished for next use
-            page_info.mark_finished()
+            # Close the page, to free up resources
+            page_info.page.close()
+            self.page_pool.pages.remove(page_info)
             return response
@@ -506,20 +521,34 @@ class AsyncStealthySession(StealthySessionMixin, AsyncSession):
         self._closed = True
+    @staticmethod
+    async def _get_page_content(page: async_Page) -> str | None:
+        """
+        A workaround for Playwright issue with `page.content()` on Windows. Ref.: https://github.com/microsoft/playwright/issues/16108
+        :param page: The page to extract content from.
+        :return:
+        """
+        while True:
+            try:
+                return (await page.content()) or ""
+            except PlaywrightError:
+                await page.wait_for_timeout(1000)
+                continue
     async def _solve_cloudflare(self, page: async_Page):
         """Solve the cloudflare challenge displayed on the playwright page passed. The async version
         :param page: The async targeted page
         :return:
         """
-        challenge_type = self._detect_cloudflare(await page.content())
+        challenge_type = self._detect_cloudflare(await self._get_page_content(page))
         if not challenge_type:
             log.error("No Cloudflare challenge found.")
             return
         else:
             log.info(f'The turnstile version discovered is "{challenge_type}"')
             if challenge_type == "non-interactive":  # pragma: no cover
-                while "<title>Just a moment...</title>" in (await page.content()):
+                while "<title>Just a moment...</title>" in (await self._get_page_content(page)):
                     log.info("Waiting for Cloudflare wait page to disappear.")
                     await page.wait_for_timeout(1000)
                     await page.wait_for_load_state()
@@ -527,7 +556,7 @@ class AsyncStealthySession(StealthySessionMixin, AsyncSession):
                 return
             else:
-                while "Verifying you are human." in (await page.content()):
+                while "Verifying you are human." in (await self._get_page_content(page)):
                     # Waiting for the verify spinner to disappear, checking every 1s if it disappeared
                     await page.wait_for_timeout(500)
@@ -587,22 +616,22 @@ class AsyncStealthySession(StealthySessionMixin, AsyncSession):
         :param selector_config: The arguments that will be passed in the end while creating the final Selector's class.
         :return: A `Response` object.
         """
-        params = validate(
-            dict(
-                google_search=self._get_with_precedence(google_search, self.google_search, _UNSET),
-                timeout=self._get_with_precedence(timeout, self.timeout, _UNSET),
-                wait=self._get_with_precedence(wait, self.wait, _UNSET),
-                page_action=self._get_with_precedence(page_action, self.page_action, _UNSET),
-                extra_headers=self._get_with_precedence(extra_headers, self.extra_headers, _UNSET),
-                disable_resources=self._get_with_precedence(disable_resources, self.disable_resources, _UNSET),
-                wait_selector=self._get_with_precedence(wait_selector, self.wait_selector, _UNSET),
-                wait_selector_state=self._get_with_precedence(wait_selector_state, self.wait_selector_state, _UNSET),
-                network_idle=self._get_with_precedence(network_idle, self.network_idle, _UNSET),
-                load_dom=self._get_with_precedence(load_dom, self.load_dom, _UNSET),
-                solve_cloudflare=self._get_with_precedence(solve_cloudflare, self.solve_cloudflare, _UNSET),
-                selector_config=self._get_with_precedence(selector_config, self.selector_config, _UNSET),
-            ),
-            CamoufoxConfig,
+        params = _validate(
+            [
+                ("google_search", google_search, self.google_search),
+                ("timeout", timeout, self.timeout),
+                ("wait", wait, self.wait),
+                ("page_action", page_action, self.page_action),
+                ("extra_headers", extra_headers, self.extra_headers),
+                ("disable_resources", disable_resources, self.disable_resources),
+                ("wait_selector", wait_selector, self.wait_selector),
+                ("wait_selector_state", wait_selector_state, self.wait_selector_state),
+                ("network_idle", network_idle, self.network_idle),
+                ("load_dom", load_dom, self.load_dom),
+                ("solve_cloudflare", solve_cloudflare, self.solve_cloudflare),
+                ("selector_config", selector_config, self.selector_config),
+            ],
+            _UNSET,
         )
         if self._closed:  # pragma: no cover
@@ -672,8 +701,9 @@ class AsyncStealthySession(StealthySessionMixin, AsyncSession):
                 page_info.page, first_response, final_response, params.selector_config
             )
-            # Mark the page as finished for next use
-            page_info.mark_finished()
+            # Close the page, to free up resources
+            await page_info.page.close()
+            self.page_pool.pages.remove(page_info)
             return response

scrapling 0.3.3__tar.gz → 0.3.5__tar.gz

scrapling 0.3.3tar.gz → 0.3.5tar.gz