PyPI - scrapling - Versions diffs - 0.2.96__tar.gz → 0.2.98__tar.gz - Mend

scrapling 0.2.96tar.gz → 0.2.98tar.gz

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (58) hide show

{scrapling-0.2.96/scrapling.egg-info → scrapling-0.2.98}/PKG-INFO RENAMED Viewed

@@ -1,6 +1,6 @@
 Metadata-Version: 2.2
 Name: scrapling
-Version: 0.2.96
+Version: 0.2.98
 Summary: Scrapling is an undetectable, powerful, flexible, high-performance Python library that makes Web Scraping easy again! In an internet filled with complications,
 Home-page: https://github.com/D4Vinci/Scrapling
 Author: Karim Shoair
@@ -73,6 +73,22 @@ Scrapling is a high-performance, intelligent web scraping library for Python tha
 # Sponsors
+[Scrapeless Deep SerpApi](https://www.scrapeless.com/en/product/deep-serp-api?utm_source=website&utm_medium=ads&utm_campaign=scraping&utm_term=d4vinci) From $0.10 per 1,000 queries with a 1-2 second response time!
+Deep SerpApi is a dedicated search engine designed for large language models (LLMs) and AI agents, aiming to provide real-time, accurate and unbiased information to help AI applications retrieve and process data efficiently.
+- covering 20+ Google SERP scenarios and mainstream search engines.
+- support real-time data updates to ensure real-time and accurate information.
+- It can integrate information from all available online channels and search engines.
+- Deep SerpApi will simplify the process of integrating dynamic web information into AI solutions, and ultimately achieve an ALL-in-One API for one-click search and extraction of web data.
+- **Developer Support Program**: Integrate Scrapeless Deep SerpApi into your AI tools, applications or projects. [We already support Dify, and will soon support frameworks such as Langchain, Langflow, FlowiseAI]. Then share your results on GitHub or social media, and you will get a 1-12 month free developer support opportunity, up to 500 free usage per month.
+- 🚀 **Scraping API**: Effortless and highly customizable data extraction with a single API call, providing structured data from any website.
+- ⚡ **Scraping Browser**: AI-powered and LLM-driven, it simulates human-like behavior with genuine fingerprints and headless browser support, ensuring seamless, block-free scraping.
+- 🌐 **Proxies**: Use high-quality, rotating proxies to scrape top platforms like Amazon, Shopee, and more, with global coverage in 195+ countries.
+[![Scrapeless Banner](https://raw.githubusercontent.com/D4Vinci/Scrapling/main/images/scrapeless.jpg)](https://www.scrapeless.com/en/product/deep-serp-api?utm_source=website&utm_medium=ads&utm_campaign=scraping&utm_term=d4vinci)
+---
 [Evomi](https://evomi.com?utm_source=github&utm_medium=banner&utm_campaign=d4vinci-scrapling) is your Swiss Quality Proxy Provider, starting at **$0.49/GB**
 - 👩‍💻 **$0.49 per GB Residential Proxies**: Our price is unbeatable
@@ -88,21 +104,6 @@ Scrapling is a high-performance, intelligent web scraping library for Python tha
 [![Evomi Banner](https://my.evomi.com/images/brand/cta.png)](https://evomi.com?utm_source=github&utm_medium=banner&utm_campaign=d4vinci-scrapling)
 ---
-[Scrapeless](https://www.scrapeless.com/?utm_source=github&utm_medium=ads&utm_campaign=scraping&utm_term=D4Vinci) is your all-in-one web scraping toolkit, starting at just $0.60 per 1k URLs!
-- 🚀 Scraping API: Effortless and highly customizable data extraction with a single API call, providing structured data from any website.
-- ⚡ Scraping Browser: AI-powered and LLM-driven, it simulates human-like behavior with genuine fingerprints and headless browser support, ensuring seamless, block-free scraping.
-- 🔒 Web Unlocker: Bypass CAPTCHAs, IP blocks, and dynamic content in real time, ensuring uninterrupted access.
-- 🌐 Proxies: Use high-quality, rotating proxies to scrape top platforms like Amazon, Shopee, and more, with global coverage in 195+ countries.
-- 💼 Enterprise-Grade: Custom solutions for large-scale and complex data needs.
-- 🎁 Free Trial: Try before you buy—experience our service firsthand.
-- 💬 Pay-Per-Use: Flexible, cost-effective pricing with no long-term commitments.
-- 🔧 Easy Integration: Seamlessly integrate with your existing tools and workflows for hassle-free automation.
-[![Scrapeless Banner](https://raw.githubusercontent.com/D4Vinci/Scrapling/main/images/scrapeless.jpg)](https://www.scrapeless.com/?utm_source=github&utm_medium=ads&utm_campaign=scraping&utm_term=D4Vinci)
----
 ## Table of content
   * [Key Features](#key-features)
     * [Fetch websites as you prefer](#fetch-websites-as-you-prefer-with-async-support)
@@ -172,7 +173,7 @@ Scrapling is a high-performance, intelligent web scraping library for Python tha
 ## Getting Started
 ```python
-from scrapling import Fetcher
+from scrapling.fetchers import Fetcher
 fetcher = Fetcher(auto_match=False)
@@ -254,7 +255,7 @@ Fetchers are interfaces built on top of other libraries with added features that
 ### Features
 You might be slightly confused by now so let me clear things up. All fetcher-type classes are imported in the same way
 ```python
-from scrapling import Fetcher, StealthyFetcher, PlayWrightFetcher
+from scrapling.fetchers import Fetcher, StealthyFetcher, PlayWrightFetcher
 ```
 All of them can take these initialization arguments: `auto_match`, `huge_tree`, `keep_comments`, `keep_cdata`, `storage`, and `storage_args`, which are the same ones you give to the `Adaptor` class.
@@ -286,7 +287,7 @@ You can route all traffic (HTTP and HTTPS) to a proxy for any of these methods i
 ```
 For Async requests, you will just replace the import like below:
 ```python
->> from scrapling import AsyncFetcher
+>> from scrapling.fetchers import AsyncFetcher
 >> page = await AsyncFetcher().get('https://httpbin.org/get', stealthy_headers=True, follow_redirects=True)
 >> page = await AsyncFetcher().post('https://httpbin.org/post', data={'key': 'value'}, proxy='http://username:password@localhost:8030')
 >> page = await AsyncFetcher().put('https://httpbin.org/put', data={'key': 'value'})
@@ -540,7 +541,7 @@ When website owners implement structural changes like
 The selector will no longer function and your code needs maintenance. That's where Scrapling's auto-matching feature comes into play.
 ```python
-from scrapling import Adaptor
+from scrapling.parser import Adaptor
 # Before the change
 page = Adaptor(page_source, url='example.com')
 element = page.css('#p1' auto_save=True)
@@ -558,7 +559,7 @@ To solve this issue, I will use [The Web Archive](https://archive.org/)'s [Wayba
 If I want to extract the Questions button from the old design I can use a selector like this `#hmenus > div:nth-child(1) > ul > li:nth-child(1) > a` This selector is too specific because it was generated by Google Chrome.
 Now let's test the same selector in both versions
 ```python
->> from scrapling import Fetcher
+>> from scrapling.fetchers import Fetcher
 >> selector = '#hmenus > div:nth-child(1) > ul > li:nth-child(1) > a'
 >> old_url = "https://web.archive.org/web/20100102003420/http://stackoverflow.com/"
 >> new_url = "https://stackoverflow.com/"
@@ -619,7 +620,7 @@ Note: The filtering process always starts from the first filter it finds in the
 Examples to clear any confusion :)
 ```python
->> from scrapling import Fetcher
+>> from scrapling.fetchers import Fetcher
 >> page = Fetcher().get('https://quotes.toscrape.com/')
 # Find all elements with tag name `div`.
 >> page.find_all('div')

{scrapling-0.2.96 → scrapling-0.2.98}/README.md RENAMED Viewed

@@ -18,6 +18,22 @@ Scrapling is a high-performance, intelligent web scraping library for Python tha
 # Sponsors
+[Scrapeless Deep SerpApi](https://www.scrapeless.com/en/product/deep-serp-api?utm_source=website&utm_medium=ads&utm_campaign=scraping&utm_term=d4vinci) From $0.10 per 1,000 queries with a 1-2 second response time!
+Deep SerpApi is a dedicated search engine designed for large language models (LLMs) and AI agents, aiming to provide real-time, accurate and unbiased information to help AI applications retrieve and process data efficiently.
+- covering 20+ Google SERP scenarios and mainstream search engines.
+- support real-time data updates to ensure real-time and accurate information.
+- It can integrate information from all available online channels and search engines.
+- Deep SerpApi will simplify the process of integrating dynamic web information into AI solutions, and ultimately achieve an ALL-in-One API for one-click search and extraction of web data.
+- **Developer Support Program**: Integrate Scrapeless Deep SerpApi into your AI tools, applications or projects. [We already support Dify, and will soon support frameworks such as Langchain, Langflow, FlowiseAI]. Then share your results on GitHub or social media, and you will get a 1-12 month free developer support opportunity, up to 500 free usage per month.
+- 🚀 **Scraping API**: Effortless and highly customizable data extraction with a single API call, providing structured data from any website.
+- ⚡ **Scraping Browser**: AI-powered and LLM-driven, it simulates human-like behavior with genuine fingerprints and headless browser support, ensuring seamless, block-free scraping.
+- 🌐 **Proxies**: Use high-quality, rotating proxies to scrape top platforms like Amazon, Shopee, and more, with global coverage in 195+ countries.
+[![Scrapeless Banner](https://raw.githubusercontent.com/D4Vinci/Scrapling/main/images/scrapeless.jpg)](https://www.scrapeless.com/en/product/deep-serp-api?utm_source=website&utm_medium=ads&utm_campaign=scraping&utm_term=d4vinci)
+---
 [Evomi](https://evomi.com?utm_source=github&utm_medium=banner&utm_campaign=d4vinci-scrapling) is your Swiss Quality Proxy Provider, starting at **$0.49/GB**
 - 👩‍💻 **$0.49 per GB Residential Proxies**: Our price is unbeatable
@@ -33,21 +49,6 @@ Scrapling is a high-performance, intelligent web scraping library for Python tha
 [![Evomi Banner](https://my.evomi.com/images/brand/cta.png)](https://evomi.com?utm_source=github&utm_medium=banner&utm_campaign=d4vinci-scrapling)
 ---
-[Scrapeless](https://www.scrapeless.com/?utm_source=github&utm_medium=ads&utm_campaign=scraping&utm_term=D4Vinci) is your all-in-one web scraping toolkit, starting at just $0.60 per 1k URLs!
-- 🚀 Scraping API: Effortless and highly customizable data extraction with a single API call, providing structured data from any website.
-- ⚡ Scraping Browser: AI-powered and LLM-driven, it simulates human-like behavior with genuine fingerprints and headless browser support, ensuring seamless, block-free scraping.
-- 🔒 Web Unlocker: Bypass CAPTCHAs, IP blocks, and dynamic content in real time, ensuring uninterrupted access.
-- 🌐 Proxies: Use high-quality, rotating proxies to scrape top platforms like Amazon, Shopee, and more, with global coverage in 195+ countries.
-- 💼 Enterprise-Grade: Custom solutions for large-scale and complex data needs.
-- 🎁 Free Trial: Try before you buy—experience our service firsthand.
-- 💬 Pay-Per-Use: Flexible, cost-effective pricing with no long-term commitments.
-- 🔧 Easy Integration: Seamlessly integrate with your existing tools and workflows for hassle-free automation.
-[![Scrapeless Banner](https://raw.githubusercontent.com/D4Vinci/Scrapling/main/images/scrapeless.jpg)](https://www.scrapeless.com/?utm_source=github&utm_medium=ads&utm_campaign=scraping&utm_term=D4Vinci)
----
 ## Table of content
   * [Key Features](#key-features)
     * [Fetch websites as you prefer](#fetch-websites-as-you-prefer-with-async-support)
@@ -117,7 +118,7 @@ Scrapling is a high-performance, intelligent web scraping library for Python tha
 ## Getting Started
 ```python
-from scrapling import Fetcher
+from scrapling.fetchers import Fetcher
 fetcher = Fetcher(auto_match=False)
@@ -199,7 +200,7 @@ Fetchers are interfaces built on top of other libraries with added features that
 ### Features
 You might be slightly confused by now so let me clear things up. All fetcher-type classes are imported in the same way
 ```python
-from scrapling import Fetcher, StealthyFetcher, PlayWrightFetcher
+from scrapling.fetchers import Fetcher, StealthyFetcher, PlayWrightFetcher
 ```
 All of them can take these initialization arguments: `auto_match`, `huge_tree`, `keep_comments`, `keep_cdata`, `storage`, and `storage_args`, which are the same ones you give to the `Adaptor` class.
@@ -231,7 +232,7 @@ You can route all traffic (HTTP and HTTPS) to a proxy for any of these methods i
 ```
 For Async requests, you will just replace the import like below:
 ```python
->> from scrapling import AsyncFetcher
+>> from scrapling.fetchers import AsyncFetcher
 >> page = await AsyncFetcher().get('https://httpbin.org/get', stealthy_headers=True, follow_redirects=True)
 >> page = await AsyncFetcher().post('https://httpbin.org/post', data={'key': 'value'}, proxy='http://username:password@localhost:8030')
 >> page = await AsyncFetcher().put('https://httpbin.org/put', data={'key': 'value'})
@@ -485,7 +486,7 @@ When website owners implement structural changes like
 The selector will no longer function and your code needs maintenance. That's where Scrapling's auto-matching feature comes into play.
 ```python
-from scrapling import Adaptor
+from scrapling.parser import Adaptor
 # Before the change
 page = Adaptor(page_source, url='example.com')
 element = page.css('#p1' auto_save=True)
@@ -503,7 +504,7 @@ To solve this issue, I will use [The Web Archive](https://archive.org/)'s [Wayba
 If I want to extract the Questions button from the old design I can use a selector like this `#hmenus > div:nth-child(1) > ul > li:nth-child(1) > a` This selector is too specific because it was generated by Google Chrome.
 Now let's test the same selector in both versions
 ```python
->> from scrapling import Fetcher
+>> from scrapling.fetchers import Fetcher
 >> selector = '#hmenus > div:nth-child(1) > ul > li:nth-child(1) > a'
 >> old_url = "https://web.archive.org/web/20100102003420/http://stackoverflow.com/"
 >> new_url = "https://stackoverflow.com/"
@@ -564,7 +565,7 @@ Note: The filtering process always starts from the first filter it finds in the
 Examples to clear any confusion :)
 ```python
->> from scrapling import Fetcher
+>> from scrapling.fetchers import Fetcher
 >> page = Fetcher().get('https://quotes.toscrape.com/')
 # Find all elements with tag name `div`.
 >> page.find_all('div')

scrapling-0.2.98/scrapling/__init__.py ADDED Viewed

@@ -0,0 +1,41 @@
+__author__ = "Karim Shoair (karim.shoair@pm.me)"
+__version__ = "0.2.98"
+__copyright__ = "Copyright (c) 2024 Karim Shoair"
+# A lightweight approach to create lazy loader for each import for backward compatibility
+# This will reduces initial memory footprint significantly (only loads what's used)
+def __getattr__(name):
+    if name == 'Fetcher':
+        from scrapling.fetchers import Fetcher as cls
+        return cls
+    elif name == 'Adaptor':
+        from scrapling.parser import Adaptor as cls
+        return cls
+    elif name == 'Adaptors':
+        from scrapling.parser import Adaptors as cls
+        return cls
+    elif name == 'AttributesHandler':
+        from scrapling.core.custom_types import AttributesHandler as cls
+        return cls
+    elif name == 'TextHandler':
+        from scrapling.core.custom_types import TextHandler as cls
+        return cls
+    elif name == 'AsyncFetcher':
+        from scrapling.fetchers import AsyncFetcher as cls
+        return cls
+    elif name == 'StealthyFetcher':
+        from scrapling.fetchers import StealthyFetcher as cls
+        return cls
+    elif name == 'PlayWrightFetcher':
+        from scrapling.fetchers import PlayWrightFetcher as cls
+        return cls
+    elif name == 'CustomFetcher':
+        from scrapling.fetchers import CustomFetcher as cls
+        return cls
+    else:
+        raise AttributeError(f"module 'scrapling' has no attribute '{name}'")
+__all__ = ['Adaptor', 'Fetcher', 'AsyncFetcher', 'StealthyFetcher', 'PlayWrightFetcher']

{scrapling-0.2.96 → scrapling-0.2.98}/scrapling/core/custom_types.py RENAMED Viewed

@@ -19,9 +19,7 @@ class TextHandler(str):
     __slots__ = ()
     def __new__(cls, string):
-        if isinstance(string, str):
-            return super().__new__(cls, string)
-        return super().__new__(cls, '')
+        return super().__new__(cls, str(string))
     def __getitem__(self, key: Union[SupportsIndex, slice]) -> "TextHandler":
         lst = super().__getitem__(key)

{scrapling-0.2.96 → scrapling-0.2.98}/scrapling/core/storage_adaptors.py RENAMED Viewed

@@ -19,7 +19,7 @@ class StorageSystemMixin(ABC):
         """
         self.url = url
-    @lru_cache(None, typed=True)
+    @lru_cache(64, typed=True)
     def _get_base_url(self, default_value: str = 'default') -> str:
         if not self.url or type(self.url) is not str:
             return default_value
@@ -51,7 +51,7 @@ class StorageSystemMixin(ABC):
         raise NotImplementedError('Storage system must implement `save` method')
     @staticmethod
-    @lru_cache(None, typed=True)
+    @lru_cache(128, typed=True)
     def _get_hash(identifier: str) -> str:
         """If you want to hash identifier in your storage system, use this safer"""
         identifier = identifier.lower().strip()
@@ -63,7 +63,7 @@ class StorageSystemMixin(ABC):
         return f"{hash_value}_{len(identifier)}"  # Length to reduce collision chance
-@lru_cache(None, typed=True)
+@lru_cache(1, typed=True)
 class SQLiteStorageSystem(StorageSystemMixin):
     """The recommended system to use, it's race condition safe and thread safe.
     Mainly built so the library can run in threaded frameworks like scrapy or threaded tools

{scrapling-0.2.96 → scrapling-0.2.98}/scrapling/core/translator.py RENAMED Viewed

@@ -139,6 +139,9 @@ class TranslatorMixin:
 class HTMLTranslator(TranslatorMixin, OriginalHTMLTranslator):
-    @lru_cache(maxsize=2048)
+    @lru_cache(maxsize=256)
     def css_to_xpath(self, css: str, prefix: str = "descendant-or-self::") -> str:
         return super().css_to_xpath(css, prefix)
+translator_instance = HTMLTranslator()

{scrapling-0.2.96 → scrapling-0.2.98}/scrapling/core/utils.py RENAMED Viewed

@@ -115,7 +115,7 @@ class _StorageTools:
 #     return _impl
-@lru_cache(None, typed=True)
+@lru_cache(128, typed=True)
 def clean_spaces(string):
     string = string.replace('\t', ' ')
     string = re.sub('[\n|\r]', '', string)

scrapling-0.2.98/scrapling/defaults.py ADDED Viewed

@@ -0,0 +1,19 @@
+# If you are going to use Fetchers with the default settings, import them from this file instead for a cleaner looking code
+# A lightweight approach to create lazy loader for each import for backward compatibility
+# This will reduces initial memory footprint significantly (only loads what's used)
+def __getattr__(name):
+    if name == 'Fetcher':
+        from scrapling.fetchers import Fetcher as cls
+        return cls()
+    elif name == 'AsyncFetcher':
+        from scrapling.fetchers import AsyncFetcher as cls
+        return cls()
+    elif name == 'StealthyFetcher':
+        from scrapling.fetchers import StealthyFetcher as cls
+        return cls()
+    elif name == 'PlayWrightFetcher':
+        from scrapling.fetchers import PlayWrightFetcher as cls
+        return cls()
+    else:
+        raise AttributeError(f"module 'scrapling' has no attribute '{name}'")

scrapling 0.2.96__tar.gz → 0.2.98__tar.gz

scrapling 0.2.96tar.gz → 0.2.98tar.gz