PyPI - scrapling - Versions diffs - 0.3.7__tar.gz → 0.3.8__tar.gz - Mend

scrapling 0.3.7tar.gz → 0.3.8tar.gz

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (54) hide show

{scrapling-0.3.7/scrapling.egg-info → scrapling-0.3.8}/PKG-INFO RENAMED Viewed

@@ -1,6 +1,6 @@
 Metadata-Version: 2.4
 Name: scrapling
-Version: 0.3.7
+Version: 0.3.8
 Summary: Scrapling is an undetectable, powerful, flexible, high-performance Python library that makes Web Scraping easy and effortless as it should be!
 Home-page: https://github.com/D4Vinci/Scrapling
 Author: Karim Shoair
@@ -36,6 +36,7 @@ License: BSD 3-Clause License
         OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
 Project-URL: Homepage, https://github.com/D4Vinci/Scrapling
+Project-URL: Changelog, https://github.com/D4Vinci/Scrapling/releases
 Project-URL: Documentation, https://scrapling.readthedocs.io/en/latest/
 Project-URL: Repository, https://github.com/D4Vinci/Scrapling
 Project-URL: Bug Tracker, https://github.com/D4Vinci/Scrapling/issues
@@ -66,7 +67,7 @@ Description-Content-Type: text/markdown
 License-File: LICENSE
 Requires-Dist: lxml>=6.0.2
 Requires-Dist: cssselect>=1.3.0
-Requires-Dist: orjson>=3.11.3
+Requires-Dist: orjson>=3.11.4
 Requires-Dist: tldextract>=5.3.0
 Provides-Extra: fetchers
 Requires-Dist: click>=8.3.0; extra == "fetchers"
@@ -77,7 +78,7 @@ Requires-Dist: camoufox>=0.4.11; extra == "fetchers"
 Requires-Dist: geoip2>=5.1.0; extra == "fetchers"
 Requires-Dist: msgspec>=0.19.0; extra == "fetchers"
 Provides-Extra: ai
-Requires-Dist: mcp>=1.16.0; extra == "ai"
+Requires-Dist: mcp>=1.19.0; extra == "ai"
 Requires-Dist: markdownify>=1.2.0; extra == "ai"
 Requires-Dist: scrapling[fetchers]; extra == "ai"
 Provides-Extra: shell
@@ -157,10 +158,11 @@ Built for the modern Web, Scrapling features its own rapid parsing engine and fe
 <!-- sponsors -->
-<a href="https://www.thordata.com/?ls=github&lk=D4Vinci" target="_blank" title="A global network of over 60M+ residential proxies with 99.7% availability, ensuring stable and reliable web data scraping to support AI, BI, and workflows."><img src="https://raw.githubusercontent.com/D4Vinci/Scrapling/main/images/thordata.jpg"></a>
+<a href="https://www.scrapeless.com/en?utm_source=official&utm_term=scrapling" target="_blank" title="Effortless Web Scraping Toolkit for Business and Developers"><img src="https://raw.githubusercontent.com/D4Vinci/Scrapling/main/images/scrapeless.jpg"></a>
 <a href="https://evomi.com?utm_source=github&utm_medium=banner&utm_campaign=d4vinci-scrapling" target="_blank" title="Evomi is your Swiss Quality Proxy Provider, starting at $0.49/GB"><img src="https://raw.githubusercontent.com/D4Vinci/Scrapling/main/images/evomi.png"></a>
 <a href="https://visit.decodo.com/Dy6W0b" target="_blank" title="Try the Most Efficient Residential Proxies for Free"><img src="https://raw.githubusercontent.com/D4Vinci/Scrapling/main/images/decodo.png"></a>
 <a href="https://petrosky.io/d4vinci" target="_blank" title="PetroSky delivers cutting-edge VPS hosting."><img src="https://raw.githubusercontent.com/D4Vinci/Scrapling/main/images/petrosky.png"></a>
+<a href="https://app.cyberyozh.com/?utm_source=github&utm_medium=scrapling" target="_blank" title="We have gathered the best solutions for multi‑accounting and automation in one place."><img src="https://raw.githubusercontent.com/D4Vinci/Scrapling/main/images/cyberyozh.png"></a>
 <a href="https://www.swiftproxy.net/" target="_blank" title="Unlock Reliable Proxy Services with Swiftproxy!"><img src="https://raw.githubusercontent.com/D4Vinci/Scrapling/main/images/swiftproxy.png"></a>
 <a href="https://www.rapidproxy.io/?ref=d4v" target="_blank" title="Affordable Access to the Proxy World – bypass CAPTCHAs blocks, and avoid additional costs."><img src="https://raw.githubusercontent.com/D4Vinci/Scrapling/main/images/rapidproxy.jpg"></a>
 <a href="https://serpapi.com/?utm_source=scrapling" target="_blank" title="Scrape Google and other search engines with SerpApi"><img src="https://raw.githubusercontent.com/D4Vinci/Scrapling/main/images/SerpApi.png"></a>

{scrapling-0.3.7 → scrapling-0.3.8}/README.md RENAMED Viewed

@@ -67,10 +67,11 @@ Built for the modern Web, Scrapling features its own rapid parsing engine and fe
 <!-- sponsors -->
-<a href="https://www.thordata.com/?ls=github&lk=D4Vinci" target="_blank" title="A global network of over 60M+ residential proxies with 99.7% availability, ensuring stable and reliable web data scraping to support AI, BI, and workflows."><img src="https://raw.githubusercontent.com/D4Vinci/Scrapling/main/images/thordata.jpg"></a>
+<a href="https://www.scrapeless.com/en?utm_source=official&utm_term=scrapling" target="_blank" title="Effortless Web Scraping Toolkit for Business and Developers"><img src="https://raw.githubusercontent.com/D4Vinci/Scrapling/main/images/scrapeless.jpg"></a>
 <a href="https://evomi.com?utm_source=github&utm_medium=banner&utm_campaign=d4vinci-scrapling" target="_blank" title="Evomi is your Swiss Quality Proxy Provider, starting at $0.49/GB"><img src="https://raw.githubusercontent.com/D4Vinci/Scrapling/main/images/evomi.png"></a>
 <a href="https://visit.decodo.com/Dy6W0b" target="_blank" title="Try the Most Efficient Residential Proxies for Free"><img src="https://raw.githubusercontent.com/D4Vinci/Scrapling/main/images/decodo.png"></a>
 <a href="https://petrosky.io/d4vinci" target="_blank" title="PetroSky delivers cutting-edge VPS hosting."><img src="https://raw.githubusercontent.com/D4Vinci/Scrapling/main/images/petrosky.png"></a>
+<a href="https://app.cyberyozh.com/?utm_source=github&utm_medium=scrapling" target="_blank" title="We have gathered the best solutions for multi‑accounting and automation in one place."><img src="https://raw.githubusercontent.com/D4Vinci/Scrapling/main/images/cyberyozh.png"></a>
 <a href="https://www.swiftproxy.net/" target="_blank" title="Unlock Reliable Proxy Services with Swiftproxy!"><img src="https://raw.githubusercontent.com/D4Vinci/Scrapling/main/images/swiftproxy.png"></a>
 <a href="https://www.rapidproxy.io/?ref=d4v" target="_blank" title="Affordable Access to the Proxy World – bypass CAPTCHAs blocks, and avoid additional costs."><img src="https://raw.githubusercontent.com/D4Vinci/Scrapling/main/images/rapidproxy.jpg"></a>
 <a href="https://serpapi.com/?utm_source=scrapling" target="_blank" title="Scrape Google and other search engines with SerpApi"><img src="https://raw.githubusercontent.com/D4Vinci/Scrapling/main/images/SerpApi.png"></a>

{scrapling-0.3.7 → scrapling-0.3.8}/pyproject.toml RENAMED Viewed

@@ -4,8 +4,8 @@ build-backend = "setuptools.build_meta"
 [project]
 name = "scrapling"
-# Static version instead of dynamic version so we can get better layer caching while building docker, check the docker file to understand
-version = "0.3.7"
+# Static version instead of a dynamic version so we can get better layer caching while building docker, check the docker file to understand
+version = "0.3.8"
 description = "Scrapling is an undetectable, powerful, flexible, high-performance Python library that makes Web Scraping easy and effortless as it should be!"
 readme = {file = "README.md", content-type = "text/markdown"}
 license = {file = "LICENSE"}
@@ -59,7 +59,7 @@ classifiers = [
 dependencies = [
     "lxml>=6.0.2",
     "cssselect>=1.3.0",
-    "orjson>=3.11.3",
+    "orjson>=3.11.4",
     "tldextract>=5.3.0",
 ]
@@ -74,7 +74,7 @@ fetchers = [
     "msgspec>=0.19.0",
 ]
 ai = [
-    "mcp>=1.16.0",
+    "mcp>=1.19.0",
     "markdownify>=1.2.0",
     "scrapling[fetchers]",
 ]
@@ -89,6 +89,7 @@ all = [
 [project.urls]
 Homepage = "https://github.com/D4Vinci/Scrapling"
+Changelog = "https://github.com/D4Vinci/Scrapling/releases"
 Documentation = "https://scrapling.readthedocs.io/en/latest/"
 Repository = "https://github.com/D4Vinci/Scrapling"
 "Bug Tracker" = "https://github.com/D4Vinci/Scrapling/issues"

{scrapling-0.3.7 → scrapling-0.3.8}/scrapling/__init__.py RENAMED Viewed

@@ -1,5 +1,5 @@
 __author__ = "Karim Shoair (karim.shoair@pm.me)"
-__version__ = "0.3.7"
+__version__ = "0.3.8"
 __copyright__ = "Copyright (c) 2024 Karim Shoair"
 from typing import Any, TYPE_CHECKING

{scrapling-0.3.7 → scrapling-0.3.8}/scrapling/engines/_browsers/_base.py RENAMED Viewed

@@ -2,17 +2,27 @@ from time import time
 from asyncio import sleep as asyncio_sleep, Lock
 from camoufox import DefaultAddons
-from playwright.sync_api import BrowserContext, Playwright
+from playwright.sync_api import (
+    Page,
+    Frame,
+    BrowserContext,
+    Playwright,
+    Response as SyncPlaywrightResponse,
+)
 from playwright.async_api import (
-    BrowserContext as AsyncBrowserContext,
+    Page as AsyncPage,
+    Frame as AsyncFrame,
     Playwright as AsyncPlaywright,
+    Response as AsyncPlaywrightResponse,
+    BrowserContext as AsyncBrowserContext,
 )
+from playwright._impl._errors import Error as PlaywrightError
 from camoufox.pkgman import installed_verstr as camoufox_version
 from camoufox.utils import launch_options as generate_launch_options
 from ._page import PageInfo, PagePool
 from scrapling.parser import Selector
-from scrapling.core._types import Any, cast, Dict, Optional, TYPE_CHECKING
+from scrapling.core._types import Any, cast, Dict, List, Optional, Callable, TYPE_CHECKING
 from scrapling.engines.toolbelt.fingerprints import get_os_name
 from ._validators import validate, PlaywrightConfig, CamoufoxConfig
 from ._config_tools import _compiled_stealth_scripts, _launch_kwargs, _context_kwargs
@@ -26,10 +36,35 @@ class SyncSession:
         self.max_pages = max_pages
         self.page_pool = PagePool(max_pages)
         self._max_wait_for_page = 60
-        self.playwright: Optional[Playwright] = None
-        self.context: Optional[BrowserContext] = None
+        self.playwright: Playwright | Any = None
+        self.context: BrowserContext | Any = None
         self._closed = False
+    def __create__(self):
+        pass
+    def close(self):  # pragma: no cover
+        """Close all resources"""
+        if self._closed:
+            return
+        if self.context:
+            self.context.close()
+            self.context = None
+        if self.playwright:
+            self.playwright.stop()
+            self.playwright = None  # pyright: ignore
+        self._closed = True
+    def __enter__(self):
+        self.__create__()
+        return self
+    def __exit__(self, exc_type, exc_val, exc_tb):
+        self.close()
     def _get_page(
         self,
         timeout: int | float,
@@ -53,7 +88,9 @@ class SyncSession:
             for script in _compiled_stealth_scripts():
                 page.add_init_script(script=script)
-        return self.page_pool.add_page(page)
+        page_info = self.page_pool.add_page(page)
+        page_info.mark_busy()
+        return page_info
     def get_pool_stats(self) -> Dict[str, int]:
         """Get statistics about the current page pool"""
@@ -63,17 +100,76 @@ class SyncSession:
             "max_pages": self.max_pages,
         }
+    @staticmethod
+    def _wait_for_networkidle(page: Page | Frame, timeout: Optional[int] = None):
+        """Wait for the page to become idle (no network activity) even if there are never-ending requests."""
+        try:
+            page.wait_for_load_state("networkidle", timeout=timeout)
+        except PlaywrightError:
+            pass
+    def _wait_for_page_stability(self, page: Page | Frame, load_dom: bool, network_idle: bool):
+        page.wait_for_load_state(state="load")
+        if load_dom:
+            page.wait_for_load_state(state="domcontentloaded")
+        if network_idle:
+            self._wait_for_networkidle(page)
+    @staticmethod
+    def _create_response_handler(page_info: PageInfo, response_container: List) -> Callable:
+        """Create a response handler that captures the final navigation response.
+        :param page_info: The PageInfo object containing the page
+        :param response_container: A list to store the final response (mutable container)
+        :return: A callback function for page.on("response", ...)
+        """
+        def handle_response(finished_response: SyncPlaywrightResponse):
+            if (
+                finished_response.request.resource_type == "document"
+                and finished_response.request.is_navigation_request()
+                and finished_response.request.frame == page_info.page.main_frame
+            ):
+                response_container[0] = finished_response
+        return handle_response
 class AsyncSession:
     def __init__(self, max_pages: int = 1):
         self.max_pages = max_pages
         self.page_pool = PagePool(max_pages)
         self._max_wait_for_page = 60
-        self.playwright: Optional[AsyncPlaywright] = None
-        self.context: Optional[AsyncBrowserContext] = None
+        self.playwright: AsyncPlaywright | Any = None
+        self.context: AsyncBrowserContext | Any = None
         self._closed = False
         self._lock = Lock()
+    async def __create__(self):
+        pass
+    async def close(self):
+        """Close all resources"""
+        if self._closed:  # pragma: no cover
+            return
+        if self.context:
+            await self.context.close()
+            self.context = None  # pyright: ignore
+        if self.playwright:
+            await self.playwright.stop()
+            self.playwright = None  # pyright: ignore
+        self._closed = True
+    async def __aenter__(self):
+        await self.__create__()
+        return self
+    async def __aexit__(self, exc_type, exc_val, exc_tb):
+        await self.close()
     async def _get_page(
         self,
         timeout: int | float,
@@ -97,7 +193,6 @@ class AsyncSession:
                         f"No pages finished to clear place in the pool within the {self._max_wait_for_page}s timeout period"
                     )
-            assert self.context is not None, "Browser context not initialized"
             page = await self.context.new_page()
             page.set_default_navigation_timeout(timeout)
             page.set_default_timeout(timeout)
@@ -121,6 +216,40 @@ class AsyncSession:
             "max_pages": self.max_pages,
         }
+    @staticmethod
+    async def _wait_for_networkidle(page: AsyncPage | AsyncFrame, timeout: Optional[int] = None):
+        """Wait for the page to become idle (no network activity) even if there are never-ending requests."""
+        try:
+            await page.wait_for_load_state("networkidle", timeout=timeout)
+        except PlaywrightError:
+            pass
+    async def _wait_for_page_stability(self, page: AsyncPage | AsyncFrame, load_dom: bool, network_idle: bool):
+        await page.wait_for_load_state(state="load")
+        if load_dom:
+            await page.wait_for_load_state(state="domcontentloaded")
+        if network_idle:
+            await self._wait_for_networkidle(page)
+    @staticmethod
+    def _create_response_handler(page_info: PageInfo, response_container: List) -> Callable:
+        """Create an async response handler that captures the final navigation response.
+        :param page_info: The PageInfo object containing the page
+        :param response_container: A list to store the final response (mutable container)
+        :return: A callback function for page.on("response", ...)
+        """
+        async def handle_response(finished_response: AsyncPlaywrightResponse):
+            if (
+                finished_response.request.resource_type == "document"
+                and finished_response.request.is_navigation_request()
+                and finished_response.request.frame == page_info.page.main_frame
+            ):
+                response_container[0] = finished_response
+        return handle_response
 class DynamicSessionMixin:
     def __validate__(self, **params):
@@ -147,6 +276,7 @@ class DynamicSessionMixin:
         self.wait_selector = config.wait_selector
         self.init_script = config.init_script
         self.wait_selector_state = config.wait_selector_state
+        self.extra_flags = config.extra_flags
         self.selector_config = config.selector_config
         self.additional_args = config.additional_args
         self.page_action = config.page_action
@@ -171,6 +301,7 @@ class DynamicSessionMixin:
                     self.stealth,
                     self.hide_canvas,
                     self.disable_webgl,
+                    tuple(self.extra_flags) if self.extra_flags else tuple(),
                 )
             )
             self.launch_options["extra_http_headers"] = dict(self.launch_options["extra_http_headers"])

scrapling 0.3.7__tar.gz → 0.3.8__tar.gz

scrapling 0.3.7tar.gz → 0.3.8tar.gz