PyPI - scrapling - Versions diffs - 0.3.5__tar.gz → 0.3.6__tar.gz - Mend

scrapling 0.3.5tar.gz → 0.3.6tar.gz

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (57) hide show

{scrapling-0.3.5/scrapling.egg-info → scrapling-0.3.6}/PKG-INFO RENAMED Viewed

@@ -1,6 +1,6 @@
 Metadata-Version: 2.4
 Name: scrapling
-Version: 0.3.5
+Version: 0.3.6
 Summary: Scrapling is an undetectable, powerful, flexible, high-performance Python library that makes Web Scraping easy and effortless as it should be!
 Home-page: https://github.com/D4Vinci/Scrapling
 Author: Karim Shoair
@@ -64,7 +64,7 @@ Classifier: Typing :: Typed
 Requires-Python: >=3.10
 Description-Content-Type: text/markdown
 License-File: LICENSE
-Requires-Dist: lxml>=6.0.1
+Requires-Dist: lxml>=6.0.2
 Requires-Dist: cssselect>=1.3.0
 Requires-Dist: orjson>=3.11.3
 Requires-Dist: tldextract>=5.3.0
@@ -77,7 +77,7 @@ Requires-Dist: camoufox>=0.4.11; extra == "fetchers"
 Requires-Dist: geoip2>=5.1.0; extra == "fetchers"
 Requires-Dist: msgspec>=0.19.0; extra == "fetchers"
 Provides-Extra: ai
-Requires-Dist: mcp>=1.14.1; extra == "ai"
+Requires-Dist: mcp>=1.15.0; extra == "ai"
 Requires-Dist: markdownify>=1.2.0; extra == "ai"
 Requires-Dist: scrapling[fetchers]; extra == "ai"
 Provides-Extra: shell
@@ -139,7 +139,7 @@ Dynamic: license-file
 Scrapling isn't just another Web Scraping library. It's the first **adaptive** scraping library that learns from website changes and evolves with them. While other libraries break when websites update their structure, Scrapling automatically relocates your elements and keeps your scrapers running.
-Built for the modern Web, Scrapling has its own rapid parsing engine and its fetchers to handle all Web Scraping challenges you are facing or will face. Built by Web Scrapers for Web Scrapers and regular users, there's something for everyone.
+Built for the modern Web, Scrapling features its own rapid parsing engine and fetchers to handle all Web Scraping challenges you face or will face. Built by Web Scrapers for Web Scrapers and regular users, there's something for everyone.
 ```python
 >> from scrapling.fetchers import Fetcher, AsyncFetcher, StealthyFetcher, DynamicFetcher
@@ -163,6 +163,7 @@ Built for the modern Web, Scrapling has its own rapid parsing engine and its fet
 <a href="https://petrosky.io/d4vinci" target="_blank" title="PetroSky delivers cutting-edge VPS hosting."><img src="https://raw.githubusercontent.com/D4Vinci/Scrapling/main/images/petrosky.png"></a>
 <a href="https://www.swiftproxy.net/" target="_blank" title="Unlock Reliable Proxy Services with Swiftproxy!"><img src="https://raw.githubusercontent.com/D4Vinci/Scrapling/main/images/swiftproxy.png"></a>
 <a href="https://www.nstproxy.com/?type=flow&utm_source=scrapling" target="_blank" title="One Proxy Service, Infinite Solutions at Unbeatable Prices!"><img src="https://raw.githubusercontent.com/D4Vinci/Scrapling/main/images/NSTproxy.png"></a>
+<a href="https://www.rapidproxy.io/?ref=d4v" target="_blank" title="Affordable Access to the Proxy World – bypass CAPTCHAs blocks, and avoid additional costs."><img src="https://raw.githubusercontent.com/D4Vinci/Scrapling/main/images/rapidproxy.jpg"></a>
 <a href="https://serpapi.com/?utm_source=scrapling" target="_blank" title="Scrape Google and other search engines with SerpApi"><img src="https://raw.githubusercontent.com/D4Vinci/Scrapling/main/images/SerpApi.png"></a>
 <!-- /sponsors -->
@@ -176,7 +177,7 @@ Built for the modern Web, Scrapling has its own rapid parsing engine and its fet
 ### Advanced Websites Fetching with Session Support
 - **HTTP Requests**: Fast and stealthy HTTP requests with the `Fetcher` class. Can impersonate browsers' TLS fingerprint, headers, and use HTTP3.
 - **Dynamic Loading**: Fetch dynamic websites with full browser automation through the `DynamicFetcher` class supporting Playwright's Chromium, real Chrome, and custom stealth mode.
-- **Anti-bot Bypass**: Advanced stealth capabilities with `StealthyFetcher` using a modified version of Firefox and fingerprint spoofing. Can bypass all levels of Cloudflare's Turnstile with automation easily.
+- **Anti-bot Bypass**: Advanced stealth capabilities with `StealthyFetcher` using a modified version of Firefox and fingerprint spoofing. Can bypass all types of Cloudflare's Turnstile and Interstitial with automation easily.
 - **Session Management**: Persistent session support with `FetcherSession`, `StealthySession`, and `DynamicSession` classes for cookie and state management across requests.
 - **Async Support**: Complete async support across all fetchers and dedicated async session classes.
@@ -200,13 +201,7 @@ Built for the modern Web, Scrapling has its own rapid parsing engine and its fet
 - 📝 **Auto Selector Generation**: Generate robust CSS/XPath selectors for any element.
 - 🔌 **Familiar API**: Similar to Scrapy/BeautifulSoup with the same pseudo-elements used in Scrapy/Parsel.
 - 📘 **Complete Type Coverage**: Full type hints for excellent IDE support and code completion.
-### New Session Architecture
-Scrapling 0.3 introduces a completely revamped session system:
-- **Persistent Sessions**: Maintain cookies, headers, and authentication across multiple requests
-- **Automatic Session Management**: Smart session lifecycle handling with proper cleanup
-- **Session Inheritance**: All fetchers support both one-off requests and persistent session usage
-- **Concurrent Session Support**: Run multiple isolated sessions simultaneously
+- 🔋 **Ready Docker image**: With each release, a Docker image containing all browsers is automatically built and pushed.
 ## Getting Started
@@ -324,11 +319,11 @@ scrapling extract stealthy-fetch 'https://nopecha.com/demo/cloudflare' captchas.
 ```
 > [!NOTE]
-> There are many additional features, but we want to keep this page short, like the MCP server and the interactive Web Scraping Shell. Check out the full documentation [here](https://scrapling.readthedocs.io/en/latest/)
+> There are many additional features, but we want to keep this page concise, such as the MCP server and the interactive Web Scraping Shell. Check out the full documentation [here](https://scrapling.readthedocs.io/en/latest/)
 ## Performance Benchmarks
-Scrapling isn't just powerful—it's also blazing fast, and the updates since version 0.3 deliver exceptional performance improvements across all operations!
+Scrapling isn't just powerful—it's also blazing fast, and the updates since version 0.3 have delivered exceptional performance improvements across all operations.
 ### Text Extraction Speed Test (5000 nested elements)
@@ -391,6 +386,13 @@ Starting with v0.3.2, this installation only includes the parser engine and its
        ```
    Don't forget that you need to install the browser dependencies with `scrapling install` after any of these extras (if you didn't already)
+### Docker
+You can also install a Docker image with all extras and browsers with the following command:
+```bash
+docker pull scrapling
+```
+This image is automatically built and pushed to Docker Hub through GitHub actions right here.
 ## Contributing
 We welcome contributions! Please read our [contributing guidelines](https://github.com/D4Vinci/Scrapling/blob/main/CONTRIBUTING.md) before getting started.
@@ -398,7 +400,7 @@ We welcome contributions! Please read our [contributing guidelines](https://gith
 ## Disclaimer
 > [!CAUTION]
-> This library is provided for educational and research purposes only. By using this library, you agree to comply with local and international data scraping and privacy laws. The authors and contributors are not responsible for any misuse of this software. Always respect website terms of service and robots.txt files.
+> This library is provided for educational and research purposes only. By using this library, you agree to comply with local and international data scraping and privacy laws. The authors and contributors are not responsible for any misuse of this software. Always respect the terms of service of websites and robots.txt files.
 ## License

{scrapling-0.3.5 → scrapling-0.3.6}/README.md RENAMED Viewed

@@ -49,7 +49,7 @@
 Scrapling isn't just another Web Scraping library. It's the first **adaptive** scraping library that learns from website changes and evolves with them. While other libraries break when websites update their structure, Scrapling automatically relocates your elements and keeps your scrapers running.
-Built for the modern Web, Scrapling has its own rapid parsing engine and its fetchers to handle all Web Scraping challenges you are facing or will face. Built by Web Scrapers for Web Scrapers and regular users, there's something for everyone.
+Built for the modern Web, Scrapling features its own rapid parsing engine and fetchers to handle all Web Scraping challenges you face or will face. Built by Web Scrapers for Web Scrapers and regular users, there's something for everyone.
 ```python
 >> from scrapling.fetchers import Fetcher, AsyncFetcher, StealthyFetcher, DynamicFetcher
@@ -73,6 +73,7 @@ Built for the modern Web, Scrapling has its own rapid parsing engine and its fet
 <a href="https://petrosky.io/d4vinci" target="_blank" title="PetroSky delivers cutting-edge VPS hosting."><img src="https://raw.githubusercontent.com/D4Vinci/Scrapling/main/images/petrosky.png"></a>
 <a href="https://www.swiftproxy.net/" target="_blank" title="Unlock Reliable Proxy Services with Swiftproxy!"><img src="https://raw.githubusercontent.com/D4Vinci/Scrapling/main/images/swiftproxy.png"></a>
 <a href="https://www.nstproxy.com/?type=flow&utm_source=scrapling" target="_blank" title="One Proxy Service, Infinite Solutions at Unbeatable Prices!"><img src="https://raw.githubusercontent.com/D4Vinci/Scrapling/main/images/NSTproxy.png"></a>
+<a href="https://www.rapidproxy.io/?ref=d4v" target="_blank" title="Affordable Access to the Proxy World – bypass CAPTCHAs blocks, and avoid additional costs."><img src="https://raw.githubusercontent.com/D4Vinci/Scrapling/main/images/rapidproxy.jpg"></a>
 <a href="https://serpapi.com/?utm_source=scrapling" target="_blank" title="Scrape Google and other search engines with SerpApi"><img src="https://raw.githubusercontent.com/D4Vinci/Scrapling/main/images/SerpApi.png"></a>
 <!-- /sponsors -->
@@ -86,7 +87,7 @@ Built for the modern Web, Scrapling has its own rapid parsing engine and its fet
 ### Advanced Websites Fetching with Session Support
 - **HTTP Requests**: Fast and stealthy HTTP requests with the `Fetcher` class. Can impersonate browsers' TLS fingerprint, headers, and use HTTP3.
 - **Dynamic Loading**: Fetch dynamic websites with full browser automation through the `DynamicFetcher` class supporting Playwright's Chromium, real Chrome, and custom stealth mode.
-- **Anti-bot Bypass**: Advanced stealth capabilities with `StealthyFetcher` using a modified version of Firefox and fingerprint spoofing. Can bypass all levels of Cloudflare's Turnstile with automation easily.
+- **Anti-bot Bypass**: Advanced stealth capabilities with `StealthyFetcher` using a modified version of Firefox and fingerprint spoofing. Can bypass all types of Cloudflare's Turnstile and Interstitial with automation easily.
 - **Session Management**: Persistent session support with `FetcherSession`, `StealthySession`, and `DynamicSession` classes for cookie and state management across requests.
 - **Async Support**: Complete async support across all fetchers and dedicated async session classes.
@@ -110,13 +111,7 @@ Built for the modern Web, Scrapling has its own rapid parsing engine and its fet
 - 📝 **Auto Selector Generation**: Generate robust CSS/XPath selectors for any element.
 - 🔌 **Familiar API**: Similar to Scrapy/BeautifulSoup with the same pseudo-elements used in Scrapy/Parsel.
 - 📘 **Complete Type Coverage**: Full type hints for excellent IDE support and code completion.
-### New Session Architecture
-Scrapling 0.3 introduces a completely revamped session system:
-- **Persistent Sessions**: Maintain cookies, headers, and authentication across multiple requests
-- **Automatic Session Management**: Smart session lifecycle handling with proper cleanup
-- **Session Inheritance**: All fetchers support both one-off requests and persistent session usage
-- **Concurrent Session Support**: Run multiple isolated sessions simultaneously
+- 🔋 **Ready Docker image**: With each release, a Docker image containing all browsers is automatically built and pushed.
 ## Getting Started
@@ -234,11 +229,11 @@ scrapling extract stealthy-fetch 'https://nopecha.com/demo/cloudflare' captchas.
 ```
 > [!NOTE]
-> There are many additional features, but we want to keep this page short, like the MCP server and the interactive Web Scraping Shell. Check out the full documentation [here](https://scrapling.readthedocs.io/en/latest/)
+> There are many additional features, but we want to keep this page concise, such as the MCP server and the interactive Web Scraping Shell. Check out the full documentation [here](https://scrapling.readthedocs.io/en/latest/)
 ## Performance Benchmarks
-Scrapling isn't just powerful—it's also blazing fast, and the updates since version 0.3 deliver exceptional performance improvements across all operations!
+Scrapling isn't just powerful—it's also blazing fast, and the updates since version 0.3 have delivered exceptional performance improvements across all operations.
 ### Text Extraction Speed Test (5000 nested elements)
@@ -301,6 +296,13 @@ Starting with v0.3.2, this installation only includes the parser engine and its
        ```
    Don't forget that you need to install the browser dependencies with `scrapling install` after any of these extras (if you didn't already)
+### Docker
+You can also install a Docker image with all extras and browsers with the following command:
+```bash
+docker pull scrapling
+```
+This image is automatically built and pushed to Docker Hub through GitHub actions right here.
 ## Contributing
 We welcome contributions! Please read our [contributing guidelines](https://github.com/D4Vinci/Scrapling/blob/main/CONTRIBUTING.md) before getting started.
@@ -308,7 +310,7 @@ We welcome contributions! Please read our [contributing guidelines](https://gith
 ## Disclaimer
 > [!CAUTION]
-> This library is provided for educational and research purposes only. By using this library, you agree to comply with local and international data scraping and privacy laws. The authors and contributors are not responsible for any misuse of this software. Always respect website terms of service and robots.txt files.
+> This library is provided for educational and research purposes only. By using this library, you agree to comply with local and international data scraping and privacy laws. The authors and contributors are not responsible for any misuse of this software. Always respect the terms of service of websites and robots.txt files.
 ## License

{scrapling-0.3.5 → scrapling-0.3.6}/pyproject.toml RENAMED Viewed

@@ -4,7 +4,8 @@ build-backend = "setuptools.build_meta"
 [project]
 name = "scrapling"
-dynamic = ["version"]
+# Static version instead of dynamic version so we can get better layer caching while building docker, check the docker file to understand
+version = "0.3.6"
 description = "Scrapling is an undetectable, powerful, flexible, high-performance Python library that makes Web Scraping easy and effortless as it should be!"
 readme = {file = "README.md", content-type = "text/markdown"}
 license = {file = "LICENSE"}
@@ -56,7 +57,7 @@ classifiers = [
     "Typing :: Typed",
 ]
 dependencies = [
-    "lxml>=6.0.1",
+    "lxml>=6.0.2",
     "cssselect>=1.3.0",
     "orjson>=3.11.3",
     "tldextract>=5.3.0",
@@ -73,7 +74,7 @@ fetchers = [
     "msgspec>=0.19.0",
 ]
 ai = [
-    "mcp>=1.14.1",
+    "mcp>=1.15.0",
     "markdownify>=1.2.0",
     "scrapling[fetchers]",
 ]
@@ -99,9 +100,6 @@ scrapling = "scrapling.cli:main"
 zip-safe = false
 include-package-data = true
-[tool.setuptools.dynamic]
-version = {attr = "scrapling.__version__"}
 [tool.setuptools.packages.find]
 where = ["."]
 include = ["scrapling*"]

scrapling-0.3.6/scrapling/__init__.py ADDED Viewed

@@ -0,0 +1,38 @@
+__author__ = "Karim Shoair (karim.shoair@pm.me)"
+__version__ = "0.3.6"
+__copyright__ = "Copyright (c) 2024 Karim Shoair"
+from typing import Any, TYPE_CHECKING
+if TYPE_CHECKING:
+    from scrapling.parser import Selector, Selectors
+    from scrapling.core.custom_types import AttributesHandler, TextHandler
+    from scrapling.fetchers import Fetcher, AsyncFetcher, StealthyFetcher, DynamicFetcher
+# Lazy import mapping
+_LAZY_IMPORTS = {
+    "Fetcher": ("scrapling.fetchers", "Fetcher"),
+    "Selector": ("scrapling.parser", "Selector"),
+    "Selectors": ("scrapling.parser", "Selectors"),
+    "AttributesHandler": ("scrapling.core.custom_types", "AttributesHandler"),
+    "TextHandler": ("scrapling.core.custom_types", "TextHandler"),
+    "AsyncFetcher": ("scrapling.fetchers", "AsyncFetcher"),
+    "StealthyFetcher": ("scrapling.fetchers", "StealthyFetcher"),
+    "DynamicFetcher": ("scrapling.fetchers", "DynamicFetcher"),
+}
+__all__ = ["Selector", "Fetcher", "AsyncFetcher", "StealthyFetcher", "DynamicFetcher"]
+def __getattr__(name: str) -> Any:
+    if name in _LAZY_IMPORTS:
+        module_path, class_name = _LAZY_IMPORTS[name]
+        module = __import__(module_path, fromlist=[class_name])
+        return getattr(module, class_name)
+    else:
+        raise AttributeError(f"module {__name__!r} has no attribute {name!r}")
+def __dir__() -> list[str]:
+    """Support for dir() and autocomplete."""
+    return sorted(__all__ + ["fetchers", "parser", "cli", "core", "__author__", "__version__", "__copyright__"])

{scrapling-0.3.5 → scrapling-0.3.6}/scrapling/cli.py RENAMED Viewed

@@ -2,8 +2,9 @@ from pathlib import Path
 from subprocess import check_output
 from sys import executable as python_executable
+from scrapling.core.utils import log
 from scrapling.engines.toolbelt.custom import Response
-from scrapling.core.utils import log, _CookieParser, _ParseHeaders
+from scrapling.core.utils._shell import _CookieParser, _ParseHeaders
 from scrapling.core._types import List, Optional, Dict, Tuple, Any, Callable
 from orjson import loads as json_loads, JSONDecodeError
@@ -135,10 +136,26 @@ def install(force):  # pragma: no cover
 @command(help="Run Scrapling's MCP server (Check the docs for more info).")
-def mcp():
+@option(
+    "--http",
+    is_flag=True,
+    default=False,
+    help="Whether to run the MCP server in streamable-http transport or leave it as stdio (Default: False)",
+)
+@option(
+    "--host",
+    type=str,
+    default="0.0.0.0",
+    help="The host to use if streamable-http transport is enabled (Default: '0.0.0.0')",
+)
+@option(
+    "--port", type=int, default=8000, help="The port to use if streamable-http transport is enabled (Default: 8000)"
+)
+def mcp(http, host, port):
     from scrapling.core.ai import ScraplingMCPServer
-    ScraplingMCPServer().serve()
+    server = ScraplingMCPServer()
+    server.serve(http, host, port)
 @command(help="Interactive scraping console")
@@ -766,7 +783,7 @@ def stealthy_fetch(
     :param disable_resources: Drop requests of unnecessary resources for a speed boost.
     :param block_webrtc: Blocks WebRTC entirely.
     :param humanize: Humanize the cursor movement.
-    :param solve_cloudflare: Solves all 3 types of the Cloudflare's Turnstile wait page.
+    :param solve_cloudflare: Solves all types of the Cloudflare's Turnstile/Interstitial challenges.
     :param allow_webgl: Allow WebGL (recommended to keep enabled).
     :param network_idle: Wait for the page until there are no network connections for at least 500 ms.
     :param disable_ads: Install the uBlock Origin addon on the browser.

{scrapling-0.3.5 → scrapling-0.3.6}/scrapling/core/_types.py RENAMED Viewed

@@ -39,6 +39,4 @@ except ImportError:  # pragma: no cover
     try:
         from typing_extensions import Self  # Backport
     except ImportError:
-        from typing import TypeVar
         Self = object

{scrapling-0.3.5 → scrapling-0.3.6}/scrapling/core/ai.py RENAMED Viewed

@@ -42,10 +42,7 @@ def _ContentTranslator(content: Generator[str, None, None], page: _ScraplingResp
 class ScraplingMCPServer:
-    _server = FastMCP(name="Scrapling")
     @staticmethod
-    @_server.tool()
     def get(
         url: str,
         impersonate: Optional[BrowserTypeLiteral] = "chrome",
@@ -124,7 +121,6 @@ class ScraplingMCPServer:
         )
     @staticmethod
-    @_server.tool()
     async def bulk_get(
         urls: Tuple[str, ...],
         impersonate: Optional[BrowserTypeLiteral] = "chrome",
@@ -211,7 +207,6 @@ class ScraplingMCPServer:
             ]
     @staticmethod
-    @_server.tool()
     async def fetch(
         url: str,
         extraction_type: extraction_types = "markdown",
@@ -263,7 +258,7 @@ class ScraplingMCPServer:
         :param real_chrome: If you have a Chrome browser installed on your device, enable this, and the Fetcher will launch an instance of your browser and use it.
         :param hide_canvas: Add random noise to canvas operations to prevent fingerprinting.
         :param disable_webgl: Disables WebGL and WebGL 2.0 support entirely.
-        :param cdp_url: Instead of launching a new browser instance, connect to this CDP URL to control real browsers/NSTBrowser through CDP.
+        :param cdp_url: Instead of launching a new browser instance, connect to this CDP URL to control real browsers through CDP.
         :param google_search: Enabled by default, Scrapling will set the referer header to be as if this request came from a Google search of this website's domain name.
         :param extra_headers: A dictionary of extra headers to add to the request. _The referer set by the `google_search` argument takes priority over the referer set here if used together._
         :param proxy: The proxy to be used with requests, it can be a string or a dictionary with the keys 'server', 'username', and 'password' only.
@@ -300,7 +295,6 @@ class ScraplingMCPServer:
         )
     @staticmethod
-    @_server.tool()
     async def bulk_fetch(
         urls: Tuple[str, ...],
         extraction_type: extraction_types = "markdown",
@@ -352,7 +346,7 @@ class ScraplingMCPServer:
         :param real_chrome: If you have a Chrome browser installed on your device, enable this, and the Fetcher will launch an instance of your browser and use it.
         :param hide_canvas: Add random noise to canvas operations to prevent fingerprinting.
         :param disable_webgl: Disables WebGL and WebGL 2.0 support entirely.
-        :param cdp_url: Instead of launching a new browser instance, connect to this CDP URL to control real browsers/NSTBrowser through CDP.
+        :param cdp_url: Instead of launching a new browser instance, connect to this CDP URL to control real browsers through CDP.
         :param google_search: Enabled by default, Scrapling will set the referer header to be as if this request came from a Google search of this website's domain name.
         :param extra_headers: A dictionary of extra headers to add to the request. _The referer set by the `google_search` argument takes priority over the referer set here if used together._
         :param proxy: The proxy to be used with requests, it can be a string or a dictionary with the keys 'server', 'username', and 'password' only.
@@ -394,7 +388,6 @@ class ScraplingMCPServer:
             ]
     @staticmethod
-    @_server.tool()
     async def stealthy_fetch(
         url: str,
         extraction_type: extraction_types = "markdown",
@@ -443,7 +436,7 @@ class ScraplingMCPServer:
         :param cookies: Set cookies for the next request.
         :param addons: List of Firefox addons to use. Must be paths to extracted addons.
         :param humanize: Humanize the cursor movement. Takes either True or the MAX duration in seconds of the cursor movement. The cursor typically takes up to 1.5 seconds to move across the window.
-        :param solve_cloudflare: Solves all 3 types of the Cloudflare's Turnstile wait page before returning the response to you.
+        :param solve_cloudflare: Solves all types of the Cloudflare's Turnstile/Interstitial challenges before returning the response to you.
         :param allow_webgl: Enabled by default. Disabling WebGL is not recommended as many WAFs now check if WebGL is enabled.
         :param network_idle: Wait for the page until there are no network connections for at least 500 ms.
         :param disable_ads: Disabled by default, this installs the `uBlock Origin` addon on the browser if enabled.
@@ -494,7 +487,6 @@ class ScraplingMCPServer:
         )
     @staticmethod
-    @_server.tool()
     async def bulk_stealthy_fetch(
         urls: Tuple[str, ...],
         extraction_type: extraction_types = "markdown",
@@ -543,7 +535,7 @@ class ScraplingMCPServer:
         :param cookies: Set cookies for the next request.
         :param addons: List of Firefox addons to use. Must be paths to extracted addons.
         :param humanize: Humanize the cursor movement. Takes either True or the MAX duration in seconds of the cursor movement. The cursor typically takes up to 1.5 seconds to move across the window.
-        :param solve_cloudflare: Solves all 3 types of the Cloudflare's Turnstile wait page before returning the response to you.
+        :param solve_cloudflare: Solves all types of the Cloudflare's Turnstile/Interstitial challenges before returning the response to you.
         :param allow_webgl: Enabled by default. Disabling WebGL is not recommended as many WAFs now check if WebGL is enabled.
         :param network_idle: Wait for the page until there are no network connections for at least 500 ms.
         :param disable_ads: Disabled by default, this installs the `uBlock Origin` addon on the browser if enabled.
@@ -598,6 +590,22 @@ class ScraplingMCPServer:
                 for page in responses
             ]
-    def serve(self):
+    def serve(self, http: bool, host: str, port: int):
         """Serve the MCP server."""
-        self._server.run(transport="stdio")
+        server = FastMCP(name="Scrapling", host=host, port=port)
+        server.add_tool(self.get, title="get", description=self.get.__doc__, structured_output=True)
+        server.add_tool(self.bulk_get, title="bulk_get", description=self.bulk_get.__doc__, structured_output=True)
+        server.add_tool(self.fetch, title="fetch", description=self.fetch.__doc__, structured_output=True)
+        server.add_tool(
+            self.bulk_fetch, title="bulk_fetch", description=self.bulk_fetch.__doc__, structured_output=True
+        )
+        server.add_tool(
+            self.stealthy_fetch, title="stealthy_fetch", description=self.stealthy_fetch.__doc__, structured_output=True
+        )
+        server.add_tool(
+            self.bulk_stealthy_fetch,
+            title="bulk_stealthy_fetch",
+            description=self.bulk_stealthy_fetch.__doc__,
+            structured_output=True,
+        )
+        server.run(transport="stdio" if not http else "streamable-http")

{scrapling-0.3.5 → scrapling-0.3.6}/scrapling/core/shell.py RENAMED Viewed

@@ -22,10 +22,11 @@ from logging import (
 from orjson import loads as json_loads, JSONDecodeError
 from scrapling import __version__
+from scrapling.core.utils import log
 from scrapling.parser import Selector, Selectors
 from scrapling.core.custom_types import TextHandler
 from scrapling.engines.toolbelt.custom import Response
-from scrapling.core.utils import log, _ParseHeaders, _CookieParser
+from scrapling.core.utils._shell import _ParseHeaders, _CookieParser
 from scrapling.core._types import (
     Optional,
     Dict,

{scrapling-0.3.5 → scrapling-0.3.6}/scrapling/core/storage.py RENAMED Viewed

@@ -6,7 +6,6 @@ from sqlite3 import connect as db_connect
 from orjson import dumps, loads
 from lxml.html import HtmlElement
-from tldextract import extract as tld
 from scrapling.core.utils import _StorageTools, log
 from scrapling.core._types import Dict, Optional, Any
@@ -26,6 +25,8 @@ class StorageSystemMixin(ABC):  # pragma: no cover
             return default_value
         try:
+            from tldextract import extract as tld
             extracted = tld(self.url)
             return extracted.top_domain_under_public_suffix or extracted.domain or default_value
         except AttributeError:

{scrapling-0.3.5 → scrapling-0.3.6}/scrapling/core/utils/__init__.py RENAMED Viewed

@@ -7,4 +7,3 @@ from ._utils import (
     clean_spaces,
     html_forbidden,
 )
-from ._shell import _CookieParser, _ParseHeaders

scrapling-0.3.6/scrapling/engines/_browsers/__init__.py ADDED Viewed

File without changes

{scrapling-0.3.5 → scrapling-0.3.6}/scrapling/engines/_browsers/_base.py RENAMED Viewed

@@ -12,17 +12,13 @@ from camoufox.utils import (
     installed_verstr as camoufox_version,
 )
-from scrapling.engines.toolbelt.navigation import intercept_route, async_intercept_route
-from scrapling.core._types import (
-    Any,
-    Dict,
-    Optional,
-)
 from ._page import PageInfo, PagePool
-from ._config_tools import _compiled_stealth_scripts
-from ._config_tools import _launch_kwargs, _context_kwargs
+from scrapling.parser import Selector
+from scrapling.core._types import Dict, Optional
 from scrapling.engines.toolbelt.fingerprints import get_os_name
 from ._validators import validate, PlaywrightConfig, CamoufoxConfig
+from ._config_tools import _compiled_stealth_scripts, _launch_kwargs, _context_kwargs
+from scrapling.engines.toolbelt.navigation import intercept_route, async_intercept_route
 __ff_version_str__ = camoufox_version().split(".", 1)[0]
@@ -268,4 +264,9 @@ class StealthySessionMixin:
             if f"cType: '{ctype}'" in page_content:
                 return ctype
+        # Check if turnstile captcha is embedded inside the page (Usually inside a closed Shadow iframe)
+        selector = Selector(content=page_content)
+        if selector.css('script[src*="challenges.cloudflare.com/turnstile/v"]'):
+            return "embedded"
         return None

{scrapling-0.3.5 → scrapling-0.3.6}/scrapling/engines/_browsers/_camoufox.py RENAMED Viewed

@@ -116,7 +116,7 @@ class StealthySession(StealthySessionMixin, SyncSession):
         :param cookies: Set cookies for the next request.
         :param addons: List of Firefox addons to use. Must be paths to extracted addons.
         :param humanize: Humanize the cursor movement. Takes either True or the MAX duration in seconds of the cursor movement. The cursor typically takes up to 1.5 seconds to move across the window.
-        :param solve_cloudflare: Solves all 3 types of the Cloudflare's Turnstile wait page before returning the response to you.
+        :param solve_cloudflare: Solves all types of the Cloudflare's Turnstile/Interstitial challenges before returning the response to you.
         :param allow_webgl: Enabled by default. Disabling WebGL is not recommended as many WAFs now check if WebGL is enabled.
         :param network_idle: Wait for the page until there are no network connections for at least 500 ms.
         :param load_dom: Enabled by default, wait for all JavaScript on page(s) to fully load and execute.
@@ -237,26 +237,33 @@ class StealthySession(StealthySessionMixin, SyncSession):
                 return
             else:
-                while "Verifying you are human." in self._get_page_content(page):
-                    # Waiting for the verify spinner to disappear, checking every 1s if it disappeared
-                    page.wait_for_timeout(500)
+                box_selector = "#cf_turnstile div, #cf-turnstile div, .turnstile>div>div"
+                if challenge_type != "embedded":
+                    box_selector = ".main-content p+div>div>div"
+                    while "Verifying you are human." in self._get_page_content(page):
+                        # Waiting for the verify spinner to disappear, checking every 1s if it disappeared
+                        page.wait_for_timeout(500)
                 iframe = page.frame(url=__CF_PATTERN__)
                 if iframe is None:
-                    log.info("Didn't find Cloudflare iframe!")
+                    log.error("Didn't find Cloudflare iframe!")
                     return
-                while not iframe.frame_element().is_visible():
-                    # Double-checking that the iframe is loaded
-                    page.wait_for_timeout(500)
+                if challenge_type != "embedded":
+                    while not iframe.frame_element().is_visible():
+                        # Double-checking that the iframe is loaded
+                        page.wait_for_timeout(500)
+                iframe.wait_for_load_state(state="domcontentloaded")
+                iframe.wait_for_load_state("networkidle")
                 # Calculate the Captcha coordinates for any viewport
-                outer_box = page.locator(".main-content p+div>div>div").bounding_box()
+                outer_box = page.locator(box_selector).last.bounding_box()
                 captcha_x, captcha_y = outer_box["x"] + 26, outer_box["y"] + 25
                 # Move the mouse to the center of the window, then press and hold the left mouse button
                 page.mouse.click(captcha_x, captcha_y, delay=60, button="left")
-                page.locator(".zone-name-title").wait_for(state="hidden")
+                if challenge_type != "embedded":
+                    page.locator(".zone-name-title").wait_for(state="hidden")
                 page.wait_for_load_state(state="domcontentloaded")
                 log.info("Cloudflare captcha is solved")
@@ -293,7 +300,7 @@ class StealthySession(StealthySessionMixin, SyncSession):
         :param wait_selector_state: The state to wait for the selector given with `wait_selector`. The default state is `attached`.
         :param network_idle: Wait for the page until there are no network connections for at least 500 ms.
         :param load_dom: Enabled by default, wait for all JavaScript on page(s) to fully load and execute.
-        :param solve_cloudflare: Solves all 3 types of the Cloudflare's Turnstile wait page before returning the response to you.
+        :param solve_cloudflare: Solves all types of the Cloudflare's Turnstile/Interstitial challenges before returning the response to you.
         :param selector_config: The arguments that will be passed in the end while creating the final Selector's class.
         :return: A `Response` object.
         """
@@ -435,7 +442,7 @@ class AsyncStealthySession(StealthySessionMixin, AsyncSession):
         :param cookies: Set cookies for the next request.
         :param addons: List of Firefox addons to use. Must be paths to extracted addons.
         :param humanize: Humanize the cursor movement. Takes either True or the MAX duration in seconds of the cursor movement. The cursor typically takes up to 1.5 seconds to move across the window.
-        :param solve_cloudflare: Solves all 3 types of the Cloudflare's Turnstile wait page before returning the response to you.
+        :param solve_cloudflare: Solves all types of the Cloudflare's Turnstile/Interstitial challenges before returning the response to you.
         :param allow_webgl: Enabled by default. Disabling WebGL is not recommended as many WAFs now check if WebGL is enabled.
         :param network_idle: Wait for the page until there are no network connections for at least 500 ms.
         :param load_dom: Enabled by default, wait for all JavaScript on page(s) to fully load and execute.
@@ -556,26 +563,33 @@ class AsyncStealthySession(StealthySessionMixin, AsyncSession):
                 return
             else:
-                while "Verifying you are human." in (await self._get_page_content(page)):
-                    # Waiting for the verify spinner to disappear, checking every 1s if it disappeared
-                    await page.wait_for_timeout(500)
+                box_selector = "#cf_turnstile div, #cf-turnstile div, .turnstile>div>div"
+                if challenge_type != "embedded":
+                    box_selector = ".main-content p+div>div>div"
+                    while "Verifying you are human." in (await self._get_page_content(page)):
+                        # Waiting for the verify spinner to disappear, checking every 1s if it disappeared
+                        await page.wait_for_timeout(500)
                 iframe = page.frame(url=__CF_PATTERN__)
                 if iframe is None:
-                    log.info("Didn't find Cloudflare iframe!")
+                    log.error("Didn't find Cloudflare iframe!")
                     return
-                while not await (await iframe.frame_element()).is_visible():
-                    # Double-checking that the iframe is loaded
-                    await page.wait_for_timeout(500)
+                if challenge_type != "embedded":
+                    while not await (await iframe.frame_element()).is_visible():
+                        # Double-checking that the iframe is loaded
+                        await page.wait_for_timeout(500)
+                await iframe.wait_for_load_state(state="domcontentloaded")
+                await iframe.wait_for_load_state("networkidle")
                 # Calculate the Captcha coordinates for any viewport
-                outer_box = await page.locator(".main-content p+div>div>div").bounding_box()
+                outer_box = await page.locator(box_selector).last.bounding_box()
                 captcha_x, captcha_y = outer_box["x"] + 26, outer_box["y"] + 25
                 # Move the mouse to the center of the window, then press and hold the left mouse button
                 await page.mouse.click(captcha_x, captcha_y, delay=60, button="left")
-                await page.locator(".zone-name-title").wait_for(state="hidden")
+                if challenge_type != "embedded":
+                    await page.locator(".zone-name-title").wait_for(state="hidden")
                 await page.wait_for_load_state(state="domcontentloaded")
                 log.info("Cloudflare captcha is solved")
@@ -612,7 +626,7 @@ class AsyncStealthySession(StealthySessionMixin, AsyncSession):
         :param wait_selector_state: The state to wait for the selector given with `wait_selector`. The default state is `attached`.
         :param network_idle: Wait for the page until there are no network connections for at least 500 ms.
         :param load_dom: Enabled by default, wait for all JavaScript on page(s) to fully load and execute.
-        :param solve_cloudflare: Solves all 3 types of the Cloudflare's Turnstile wait page before returning the response to you.
+        :param solve_cloudflare: Solves all types of the Cloudflare's Turnstile/Interstitial challenges before returning the response to you.
         :param selector_config: The arguments that will be passed in the end while creating the final Selector's class.
         :return: A `Response` object.
         """

scrapling 0.3.5__tar.gz → 0.3.6__tar.gz

scrapling 0.3.5tar.gz → 0.3.6tar.gz