PyPI - scrapling - Versions diffs - 0.2.92__tar.gz → 0.2.93__tar.gz - Mend

scrapling 0.2.92tar.gz → 0.2.93tar.gz

Files changed (57) hide show

{scrapling-0.2.92/scrapling.egg-info → scrapling-0.2.93}/PKG-INFO RENAMED Viewed

@@ -1,6 +1,6 @@
-Metadata-Version: 2.1
+Metadata-Version: 2.2
 Name: scrapling
-Version: 0.2.92
+Version: 0.2.93
 Summary: Scrapling is a powerful, flexible, and high-performance web scraping library for Python. It
 Home-page: https://github.com/D4Vinci/Scrapling
 Author: Karim Shoair
@@ -10,7 +10,7 @@ Project-URL: Documentation, https://github.com/D4Vinci/Scrapling/tree/main/docs
 Project-URL: Source, https://github.com/D4Vinci/Scrapling
 Project-URL: Tracker, https://github.com/D4Vinci/Scrapling/issues
 Classifier: Operating System :: OS Independent
-Classifier: Development Status :: 4 - Beta
+Classifier: Development Status :: 4 - Beta
 Classifier: Intended Audience :: Developers
 Classifier: License :: OSI Approved :: BSD License
 Classifier: Natural Language :: English
@@ -31,8 +31,7 @@ Classifier: Typing :: Typed
 Requires-Python: >=3.9
 Description-Content-Type: text/markdown
 License-File: LICENSE
-Requires-Dist: requests>=2.3
-Requires-Dist: lxml>=4.5
+Requires-Dist: lxml>=5.0
 Requires-Dist: cssselect>=1.2
 Requires-Dist: click
 Requires-Dist: w3lib
@@ -41,7 +40,18 @@ Requires-Dist: tldextract
 Requires-Dist: httpx[brotli,socks,zstd]
 Requires-Dist: playwright>=1.49.1
 Requires-Dist: rebrowser-playwright>=1.49.1
-Requires-Dist: camoufox[geoip]>=0.4.9
+Requires-Dist: camoufox[geoip]>=0.4.10
+Dynamic: author
+Dynamic: author-email
+Dynamic: classifier
+Dynamic: description
+Dynamic: description-content-type
+Dynamic: home-page
+Dynamic: license
+Dynamic: project-url
+Dynamic: requires-dist
+Dynamic: requires-python
+Dynamic: summary
 # 🕷️ Scrapling: Undetectable, Lightning-Fast, and Adaptive Web Scraping for Python
 [![Tests](https://github.com/D4Vinci/Scrapling/actions/workflows/tests.yml/badge.svg)](https://github.com/D4Vinci/Scrapling/actions/workflows/tests.yml) [![PyPI version](https://badge.fury.io/py/Scrapling.svg)](https://badge.fury.io/py/Scrapling) [![Supported Python versions](https://img.shields.io/pypi/pyversions/scrapling.svg)](https://pypi.org/project/scrapling/) [![PyPI Downloads](https://static.pepy.tech/badge/scrapling)](https://pepy.tech/project/scrapling)
@@ -78,6 +88,21 @@ Scrapling is a high-performance, intelligent web scraping library for Python tha
 [![Evomi Banner](https://my.evomi.com/images/brand/cta.png)](https://evomi.com?utm_source=github&utm_medium=banner&utm_campaign=d4vinci-scrapling)
 ---
+[Scrapeless](https://www.scrapeless.com/?utm_source=github&utm_medium=ads&utm_campaign=scraping&utm_term=D4Vinci) is your all-in-one web scraping toolkit, starting at just $0.60 per 1k URLs!
+- 🚀 Scraping API: Effortless and highly customizable data extraction with a single API call, providing structured data from any website.
+- ⚡ Scraping Browser: AI-powered and LLM-driven, it simulates human-like behavior with genuine fingerprints and headless browser support, ensuring seamless, block-free scraping.
+- 🔒 Web Unlocker: Bypass CAPTCHAs, IP blocks, and dynamic content in real time, ensuring uninterrupted access.
+- 🌐 Proxies: Use high-quality, rotating proxies to scrape top platforms like Amazon, Shopee, and more, with global coverage in 195+ countries.
+- 💼 Enterprise-Grade: Custom solutions for large-scale and complex data needs.
+- 🎁 Free Trial: Try before you buy—experience our service firsthand.
+- 💬 Pay-Per-Use: Flexible, cost-effective pricing with no long-term commitments.
+- 🔧 Easy Integration: Seamlessly integrate with your existing tools and workflows for hassle-free automation.
+[![Scrapeless Banner](https://raw.githubusercontent.com/D4Vinci/Scrapling/main/images/scrapeless.jpg)](https://www.scrapeless.com/?utm_source=github&utm_medium=ads&utm_campaign=scraping&utm_term=D4Vinci)
+---
 ## Table of content
   * [Key Features](#key-features)
     * [Fetch websites as you prefer](#fetch-websites-as-you-prefer-with-async-support)
@@ -122,27 +147,27 @@ Scrapling is a high-performance, intelligent web scraping library for Python tha
 ## Key Features
 ### Fetch websites as you prefer with async support
-- **HTTP requests**: Stealthy and fast HTTP requests with `Fetcher`
-- **Stealthy fetcher**: Annoying anti-bot protection? No problem! Scrapling can bypass almost all of them with `StealthyFetcher` with default configuration!
-- **Your preferred browser**: Use your real browser with CDP, [NSTbrowser](https://app.nstbrowser.io/r/1vO5e5)'s browserless, PlayWright with stealth mode, or even vanilla PlayWright -  All is possible with `PlayWrightFetcher`!
+- **HTTP Requests**: Fast and stealthy HTTP requests with the `Fetcher` class.
+- **Dynamic Loading & Automation**: Fetch dynamic websites with the `PlayWrightFetcher` class through your real browser, Scrapling's stealth mode, Playwright's Chrome browser, or [NSTbrowser](https://app.nstbrowser.io/r/1vO5e5)'s browserless!
+- **Anti-bot Protections Bypass**: Easily bypass protections with `StealthyFetcher` and `PlayWrightFetcher` classes.
 ### Adaptive Scraping
-- 🔄 **Smart Element Tracking**: Locate previously identified elements after website structure changes, using an intelligent similarity system and integrated storage.
-- 🎯 **Flexible Querying**: Use CSS selectors, XPath, Elements filters, text search, or regex - chain them however you want!
-- 🔍 **Find Similar Elements**: Automatically locate elements similar to the element you want on the page (Ex: other products like the product you found on the page).
+- 🔄 **Smart Element Tracking**: Relocate elements after website changes, using an intelligent similarity system and integrated storage.
+- 🎯 **Flexible Selection**: CSS selectors, XPath selectors, filters-based search, text search, regex search and more.
+- 🔍 **Find Similar Elements**: Automatically locate elements similar to the element you found!
 - 🧠 **Smart Content Scraping**: Extract data from multiple websites without specific selectors using Scrapling powerful features.
-### Performance
-- 🚀 **Lightning Fast**: Built from the ground up with performance in mind, outperforming most popular Python scraping libraries (outperforming BeautifulSoup in parsing by up to 620x in our tests).
+### High Performance
+- 🚀 **Lightning Fast**: Built from the ground up with performance in mind, outperforming most popular Python scraping libraries.
 - 🔋 **Memory Efficient**: Optimized data structures for minimal memory footprint.
-- ⚡ **Fast JSON serialization**: 10x faster JSON serialization than the standard json library with more options.
+- ⚡ **Fast JSON serialization**: 10x faster than standard library.
-### Developing Experience
-- 🛠️ **Powerful Navigation API**: Traverse the DOM tree easily in all directions and get the info you want (parent, ancestors, sibling, children, next/previous element, and more).
-- 🧬 **Rich Text Processing**: All strings have built-in methods for regex matching, cleaning, and more. All elements' attributes are read-only dictionaries that are faster than standard dictionaries with added methods.
-- 📝 **Automatic Selector Generation**: Create robust CSS/XPath selectors for any element.
-- 🔌 **API Similar to Scrapy/BeautifulSoup**: Familiar methods and similar pseudo-elements for Scrapy and BeautifulSoup users.
-- 📘 **Type hints and test coverage**: Complete type coverage and almost full test coverage for better IDE support and fewer bugs, respectively.
+### Developer Friendly
+- 🛠️ **Powerful Navigation API**: Easy DOM traversal in all directions.
+- 🧬 **Rich Text Processing**: All strings have built-in regex, cleaning methods, and more. All elements' attributes are optimized dictionaries that takes less memory than standard dictionaries with added methods.
+- 📝 **Auto Selectors Generation**: Generate robust short and full CSS/XPath selectors for any element.
+- 🔌 **Familiar API**: Similar to Scrapy/BeautifulSoup and the same pseudo-elements used in Scrapy.
+- 📘 **Type hints**: Complete type/doc-strings coverage for future-proofing and best autocompletion support.
 ## Getting Started
@@ -151,21 +176,22 @@ from scrapling import Fetcher
 fetcher = Fetcher(auto_match=False)
-# Fetch a web page and create an Adaptor instance
+# Do http GET request to a web page and create an Adaptor instance
 page = fetcher.get('https://quotes.toscrape.com/', stealthy_headers=True)
-# Get all strings in the full page
+# Get all text content from all HTML tags in the page except `script` and `style` tags
 page.get_all_text(ignore_tags=('script', 'style'))
-# Get all quotes, any of these methods will return a list of strings (TextHandlers)
+# Get all quotes elements, any of these methods will return a list of strings directly (TextHandlers)
 quotes = page.css('.quote .text::text')  # CSS selector
 quotes = page.xpath('//span[@class="text"]/text()')  # XPath
 quotes = page.css('.quote').css('.text::text')  # Chained selectors
 quotes = [element.text for element in page.css('.quote .text')]  # Slower than bulk query above
 # Get the first quote element
-quote = page.css_first('.quote')  # / page.css('.quote').first / page.css('.quote')[0]
+quote = page.css_first('.quote')  # same as page.css('.quote').first or page.css('.quote')[0]
 # Tired of selectors? Use find_all/find
+# Get all 'div' HTML tags that one of its 'class' values is 'quote'
 quotes = page.find_all('div', {'class': 'quote'})
 # Same as
 quotes = page.find_all('div', class_='quote')
@@ -173,10 +199,10 @@ quotes = page.find_all(['div'], class_='quote')
 quotes = page.find_all(class_='quote')  # and so on...
 # Working with elements
-quote.html_content  # Inner HTML
-quote.prettify()  # Prettified version of Inner HTML
-quote.attrib  # Element attributes
-quote.path  # DOM path to element (List)
+quote.html_content  # Get Inner HTML of this element
+quote.prettify()  # Prettified version of Inner HTML above
+quote.attrib  # Get that element's attributes
+quote.path  # DOM path to element (List of all ancestors from <html> tag till the element itself)
 ```
 To keep it simple, all methods can be chained on top of each other!
@@ -292,7 +318,7 @@ True
 |      humanize       | Humanize the cursor movement. Takes either True or the MAX duration in seconds of the cursor movement. The cursor typically takes up to 1.5 seconds to move across the window.                                                                                                                                                                                                                                  |    ✔️    |
 |     allow_webgl     | Enabled by default. Disabling it WebGL not recommended as many WAFs now checks if WebGL is enabled.                                                                                                                                                                                                                                                                                                             |    ✔️    |
 |        geoip        | Recommended to use with proxies; Automatically use IP's longitude, latitude, timezone, country, locale, & spoof the WebRTC IP address. It will also calculate and spoof the browser's language based on the distribution of language speakers in the target region.                                                                                                                                             |    ✔️    |
-|     disable_ads     | Enabled by default, this installs `uBlock Origin` addon on the browser if enabled.                                                                                                                                                                                                                                                                                                                              |    ✔️    |
+|     disable_ads     | Disabled by default, this installs `uBlock Origin` addon on the browser if enabled.                                                                                                                                                                                                                                                                                                                             |    ✔️    |
 |    network_idle     | Wait for the page until there are no network connections for at least 500 ms.                                                                                                                                                                                                                                                                                                                                   |    ✔️    |
 |       timeout       | The timeout in milliseconds that is used in all operations and waits through the page. The default is 30000.                                                                                                                                                                                                                                                                                                    |    ✔️    |
 |    wait_selector    | Wait for a specific css selector to be in a specific state.                                                                                                                                                                                                                                                                                                                                                     |    ✔️    |
@@ -574,7 +600,7 @@ Inspired by BeautifulSoup's `find_all` function you can find elements by using `
   * Any string passed is considered a tag name
   * Any iterable passed like List/Tuple/Set is considered an iterable of tag names.
   * Any dictionary is considered a mapping of HTML element(s) attribute names and attribute values.
-  * Any regex patterns passed are used as filters
+  * Any regex patterns passed are used as filters to elements by their text content
   * Any functions passed are used as filters
   * Any keyword argument passed is considered as an HTML element attribute with its value.
@@ -583,7 +609,7 @@ So the way it works is after collecting all passed arguments and keywords, each
 1. All elements with the passed tag name(s).
 2. All elements that match all passed attribute(s).
-3. All elements that match all passed regex patterns.
+3. All elements that its text content match all passed regex patterns.
 4. All elements that fulfill all passed function(s).
 Note: The filtering process always starts from the first filter it finds in the filtering order above so if no tag name(s) are passed but attributes are passed, the process starts from that layer and so on. **But the order in which you pass the arguments doesn't matter.**

{scrapling-0.2.92 → scrapling-0.2.93}/README.md RENAMED Viewed

@@ -33,6 +33,21 @@ Scrapling is a high-performance, intelligent web scraping library for Python tha
 [![Evomi Banner](https://my.evomi.com/images/brand/cta.png)](https://evomi.com?utm_source=github&utm_medium=banner&utm_campaign=d4vinci-scrapling)
 ---
+[Scrapeless](https://www.scrapeless.com/?utm_source=github&utm_medium=ads&utm_campaign=scraping&utm_term=D4Vinci) is your all-in-one web scraping toolkit, starting at just $0.60 per 1k URLs!
+- 🚀 Scraping API: Effortless and highly customizable data extraction with a single API call, providing structured data from any website.
+- ⚡ Scraping Browser: AI-powered and LLM-driven, it simulates human-like behavior with genuine fingerprints and headless browser support, ensuring seamless, block-free scraping.
+- 🔒 Web Unlocker: Bypass CAPTCHAs, IP blocks, and dynamic content in real time, ensuring uninterrupted access.
+- 🌐 Proxies: Use high-quality, rotating proxies to scrape top platforms like Amazon, Shopee, and more, with global coverage in 195+ countries.
+- 💼 Enterprise-Grade: Custom solutions for large-scale and complex data needs.
+- 🎁 Free Trial: Try before you buy—experience our service firsthand.
+- 💬 Pay-Per-Use: Flexible, cost-effective pricing with no long-term commitments.
+- 🔧 Easy Integration: Seamlessly integrate with your existing tools and workflows for hassle-free automation.
+[![Scrapeless Banner](https://raw.githubusercontent.com/D4Vinci/Scrapling/main/images/scrapeless.jpg)](https://www.scrapeless.com/?utm_source=github&utm_medium=ads&utm_campaign=scraping&utm_term=D4Vinci)
+---
 ## Table of content
   * [Key Features](#key-features)
     * [Fetch websites as you prefer](#fetch-websites-as-you-prefer-with-async-support)
@@ -77,27 +92,27 @@ Scrapling is a high-performance, intelligent web scraping library for Python tha
 ## Key Features
 ### Fetch websites as you prefer with async support
-- **HTTP requests**: Stealthy and fast HTTP requests with `Fetcher`
-- **Stealthy fetcher**: Annoying anti-bot protection? No problem! Scrapling can bypass almost all of them with `StealthyFetcher` with default configuration!
-- **Your preferred browser**: Use your real browser with CDP, [NSTbrowser](https://app.nstbrowser.io/r/1vO5e5)'s browserless, PlayWright with stealth mode, or even vanilla PlayWright -  All is possible with `PlayWrightFetcher`!
+- **HTTP Requests**: Fast and stealthy HTTP requests with the `Fetcher` class.
+- **Dynamic Loading & Automation**: Fetch dynamic websites with the `PlayWrightFetcher` class through your real browser, Scrapling's stealth mode, Playwright's Chrome browser, or [NSTbrowser](https://app.nstbrowser.io/r/1vO5e5)'s browserless!
+- **Anti-bot Protections Bypass**: Easily bypass protections with `StealthyFetcher` and `PlayWrightFetcher` classes.
 ### Adaptive Scraping
-- 🔄 **Smart Element Tracking**: Locate previously identified elements after website structure changes, using an intelligent similarity system and integrated storage.
-- 🎯 **Flexible Querying**: Use CSS selectors, XPath, Elements filters, text search, or regex - chain them however you want!
-- 🔍 **Find Similar Elements**: Automatically locate elements similar to the element you want on the page (Ex: other products like the product you found on the page).
+- 🔄 **Smart Element Tracking**: Relocate elements after website changes, using an intelligent similarity system and integrated storage.
+- 🎯 **Flexible Selection**: CSS selectors, XPath selectors, filters-based search, text search, regex search and more.
+- 🔍 **Find Similar Elements**: Automatically locate elements similar to the element you found!
 - 🧠 **Smart Content Scraping**: Extract data from multiple websites without specific selectors using Scrapling powerful features.
-### Performance
-- 🚀 **Lightning Fast**: Built from the ground up with performance in mind, outperforming most popular Python scraping libraries (outperforming BeautifulSoup in parsing by up to 620x in our tests).
+### High Performance
+- 🚀 **Lightning Fast**: Built from the ground up with performance in mind, outperforming most popular Python scraping libraries.
 - 🔋 **Memory Efficient**: Optimized data structures for minimal memory footprint.
-- ⚡ **Fast JSON serialization**: 10x faster JSON serialization than the standard json library with more options.
+- ⚡ **Fast JSON serialization**: 10x faster than standard library.
-### Developing Experience
-- 🛠️ **Powerful Navigation API**: Traverse the DOM tree easily in all directions and get the info you want (parent, ancestors, sibling, children, next/previous element, and more).
-- 🧬 **Rich Text Processing**: All strings have built-in methods for regex matching, cleaning, and more. All elements' attributes are read-only dictionaries that are faster than standard dictionaries with added methods.
-- 📝 **Automatic Selector Generation**: Create robust CSS/XPath selectors for any element.
-- 🔌 **API Similar to Scrapy/BeautifulSoup**: Familiar methods and similar pseudo-elements for Scrapy and BeautifulSoup users.
-- 📘 **Type hints and test coverage**: Complete type coverage and almost full test coverage for better IDE support and fewer bugs, respectively.
+### Developer Friendly
+- 🛠️ **Powerful Navigation API**: Easy DOM traversal in all directions.
+- 🧬 **Rich Text Processing**: All strings have built-in regex, cleaning methods, and more. All elements' attributes are optimized dictionaries that takes less memory than standard dictionaries with added methods.
+- 📝 **Auto Selectors Generation**: Generate robust short and full CSS/XPath selectors for any element.
+- 🔌 **Familiar API**: Similar to Scrapy/BeautifulSoup and the same pseudo-elements used in Scrapy.
+- 📘 **Type hints**: Complete type/doc-strings coverage for future-proofing and best autocompletion support.
 ## Getting Started
@@ -106,21 +121,22 @@ from scrapling import Fetcher
 fetcher = Fetcher(auto_match=False)
-# Fetch a web page and create an Adaptor instance
+# Do http GET request to a web page and create an Adaptor instance
 page = fetcher.get('https://quotes.toscrape.com/', stealthy_headers=True)
-# Get all strings in the full page
+# Get all text content from all HTML tags in the page except `script` and `style` tags
 page.get_all_text(ignore_tags=('script', 'style'))
-# Get all quotes, any of these methods will return a list of strings (TextHandlers)
+# Get all quotes elements, any of these methods will return a list of strings directly (TextHandlers)
 quotes = page.css('.quote .text::text')  # CSS selector
 quotes = page.xpath('//span[@class="text"]/text()')  # XPath
 quotes = page.css('.quote').css('.text::text')  # Chained selectors
 quotes = [element.text for element in page.css('.quote .text')]  # Slower than bulk query above
 # Get the first quote element
-quote = page.css_first('.quote')  # / page.css('.quote').first / page.css('.quote')[0]
+quote = page.css_first('.quote')  # same as page.css('.quote').first or page.css('.quote')[0]
 # Tired of selectors? Use find_all/find
+# Get all 'div' HTML tags that one of its 'class' values is 'quote'
 quotes = page.find_all('div', {'class': 'quote'})
 # Same as
 quotes = page.find_all('div', class_='quote')
@@ -128,10 +144,10 @@ quotes = page.find_all(['div'], class_='quote')
 quotes = page.find_all(class_='quote')  # and so on...
 # Working with elements
-quote.html_content  # Inner HTML
-quote.prettify()  # Prettified version of Inner HTML
-quote.attrib  # Element attributes
-quote.path  # DOM path to element (List)
+quote.html_content  # Get Inner HTML of this element
+quote.prettify()  # Prettified version of Inner HTML above
+quote.attrib  # Get that element's attributes
+quote.path  # DOM path to element (List of all ancestors from <html> tag till the element itself)
 ```
 To keep it simple, all methods can be chained on top of each other!
@@ -247,7 +263,7 @@ True
 |      humanize       | Humanize the cursor movement. Takes either True or the MAX duration in seconds of the cursor movement. The cursor typically takes up to 1.5 seconds to move across the window.                                                                                                                                                                                                                                  |    ✔️    |
 |     allow_webgl     | Enabled by default. Disabling it WebGL not recommended as many WAFs now checks if WebGL is enabled.                                                                                                                                                                                                                                                                                                             |    ✔️    |
 |        geoip        | Recommended to use with proxies; Automatically use IP's longitude, latitude, timezone, country, locale, & spoof the WebRTC IP address. It will also calculate and spoof the browser's language based on the distribution of language speakers in the target region.                                                                                                                                             |    ✔️    |
-|     disable_ads     | Enabled by default, this installs `uBlock Origin` addon on the browser if enabled.                                                                                                                                                                                                                                                                                                                              |    ✔️    |
+|     disable_ads     | Disabled by default, this installs `uBlock Origin` addon on the browser if enabled.                                                                                                                                                                                                                                                                                                                             |    ✔️    |
 |    network_idle     | Wait for the page until there are no network connections for at least 500 ms.                                                                                                                                                                                                                                                                                                                                   |    ✔️    |
 |       timeout       | The timeout in milliseconds that is used in all operations and waits through the page. The default is 30000.                                                                                                                                                                                                                                                                                                    |    ✔️    |
 |    wait_selector    | Wait for a specific css selector to be in a specific state.                                                                                                                                                                                                                                                                                                                                                     |    ✔️    |
@@ -529,7 +545,7 @@ Inspired by BeautifulSoup's `find_all` function you can find elements by using `
   * Any string passed is considered a tag name
   * Any iterable passed like List/Tuple/Set is considered an iterable of tag names.
   * Any dictionary is considered a mapping of HTML element(s) attribute names and attribute values.
-  * Any regex patterns passed are used as filters
+  * Any regex patterns passed are used as filters to elements by their text content
   * Any functions passed are used as filters
   * Any keyword argument passed is considered as an HTML element attribute with its value.
@@ -538,7 +554,7 @@ So the way it works is after collecting all passed arguments and keywords, each
 1. All elements with the passed tag name(s).
 2. All elements that match all passed attribute(s).
-3. All elements that match all passed regex patterns.
+3. All elements that its text content match all passed regex patterns.
 4. All elements that fulfill all passed function(s).
 Note: The filtering process always starts from the first filter it finds in the filtering order above so if no tag name(s) are passed but attributes are passed, the process starts from that layer and so on. **But the order in which you pass the arguments doesn't matter.**

{scrapling-0.2.92 → scrapling-0.2.93}/scrapling/__init__.py RENAMED Viewed

@@ -5,7 +5,7 @@ from scrapling.fetchers import (AsyncFetcher, CustomFetcher, Fetcher,
 from scrapling.parser import Adaptor, Adaptors
 __author__ = "Karim Shoair (karim.shoair@pm.me)"
-__version__ = "0.2.92"
+__version__ = "0.2.93"
 __copyright__ = "Copyright (c) 2024 Karim Shoair"

{scrapling-0.2.92 → scrapling-0.2.93}/scrapling/core/_types.py RENAMED Viewed

@@ -3,7 +3,8 @@ Type definitions for type checking purposes.
 """
 from typing import (TYPE_CHECKING, Any, Callable, Dict, Generator, Iterable,
-                    List, Literal, Optional, Pattern, Tuple, Type, Union)
+                    List, Literal, Optional, Pattern, Tuple, Type, TypeVar,
+                    Union)
 SelectorWaitStates = Literal["attached", "detached", "hidden", "visible"]

{scrapling-0.2.92 → scrapling-0.2.93}/scrapling/core/custom_types.py RENAMED Viewed

@@ -1,13 +1,18 @@
 import re
+import typing
 from collections.abc import Mapping
 from types import MappingProxyType
 from orjson import dumps, loads
 from w3lib.html import replace_entities as _replace_entities
-from scrapling.core._types import Dict, List, Pattern, SupportsIndex, Union
+from scrapling.core._types import (Dict, Iterable, List, Literal, Optional,
+                                   Pattern, SupportsIndex, TypeVar, Union)
 from scrapling.core.utils import _is_iterable, flatten
+# Define type variable for AttributeHandler value type
+_TextHandlerType = TypeVar('_TextHandlerType', bound='TextHandler')
 class TextHandler(str):
     """Extends standard Python string by adding more functionality"""
@@ -18,72 +23,89 @@ class TextHandler(str):
             return super().__new__(cls, string)
         return super().__new__(cls, '')
-    # Make methods from original `str` class return `TextHandler` instead of returning `str` again
-    # Of course, this stupid workaround is only so we can keep the auto-completion working without issues in your IDE
-    # and I made sonnet write it for me :)
-    def strip(self, chars=None):
+    @typing.overload
+    def __getitem__(self, key: SupportsIndex) -> 'TextHandler':
+        pass
+    @typing.overload
+    def __getitem__(self, key: slice) -> "TextHandlers":
+        pass
+    def __getitem__(self, key: Union[SupportsIndex, slice]) -> Union["TextHandler", "TextHandlers"]:
+        lst = super().__getitem__(key)
+        if isinstance(key, slice):
+            lst = [TextHandler(s) for s in lst]
+            return TextHandlers(typing.cast(List[_TextHandlerType], lst))
+        return typing.cast(_TextHandlerType, TextHandler(lst))
+    def split(self, sep: str = None, maxsplit: SupportsIndex = -1) -> 'TextHandlers':
+        return TextHandlers(
+            typing.cast(List[_TextHandlerType], [TextHandler(s) for s in super().split(sep, maxsplit)])
+        )
+    def strip(self, chars: str = None) -> Union[str, 'TextHandler']:
         return TextHandler(super().strip(chars))
-    def lstrip(self, chars=None):
+    def lstrip(self, chars: str = None) -> Union[str, 'TextHandler']:
         return TextHandler(super().lstrip(chars))
-    def rstrip(self, chars=None):
+    def rstrip(self, chars: str = None) -> Union[str, 'TextHandler']:
         return TextHandler(super().rstrip(chars))
-    def capitalize(self):
+    def capitalize(self) -> Union[str, 'TextHandler']:
         return TextHandler(super().capitalize())
-    def casefold(self):
+    def casefold(self) -> Union[str, 'TextHandler']:
         return TextHandler(super().casefold())
-    def center(self, width, fillchar=' '):
+    def center(self, width: SupportsIndex, fillchar: str = ' ') -> Union[str, 'TextHandler']:
         return TextHandler(super().center(width, fillchar))
-    def expandtabs(self, tabsize=8):
+    def expandtabs(self, tabsize: SupportsIndex = 8) -> Union[str, 'TextHandler']:
         return TextHandler(super().expandtabs(tabsize))
-    def format(self, *args, **kwargs):
+    def format(self, *args: str, **kwargs: str) -> Union[str, 'TextHandler']:
         return TextHandler(super().format(*args, **kwargs))
-    def format_map(self, mapping):
+    def format_map(self, mapping) -> Union[str, 'TextHandler']:
         return TextHandler(super().format_map(mapping))
-    def join(self, iterable):
+    def join(self, iterable: Iterable[str]) -> Union[str, 'TextHandler']:
         return TextHandler(super().join(iterable))
-    def ljust(self, width, fillchar=' '):
+    def ljust(self, width: SupportsIndex, fillchar: str = ' ') -> Union[str, 'TextHandler']:
         return TextHandler(super().ljust(width, fillchar))
-    def rjust(self, width, fillchar=' '):
+    def rjust(self, width: SupportsIndex, fillchar: str = ' ') -> Union[str, 'TextHandler']:
         return TextHandler(super().rjust(width, fillchar))
-    def swapcase(self):
+    def swapcase(self) -> Union[str, 'TextHandler']:
         return TextHandler(super().swapcase())
-    def title(self):
+    def title(self) -> Union[str, 'TextHandler']:
         return TextHandler(super().title())
-    def translate(self, table):
+    def translate(self, table) -> Union[str, 'TextHandler']:
         return TextHandler(super().translate(table))
-    def zfill(self, width):
+    def zfill(self, width: SupportsIndex) -> Union[str, 'TextHandler']:
         return TextHandler(super().zfill(width))
-    def replace(self, old, new, count=-1):
+    def replace(self, old: str, new: str, count: SupportsIndex = -1) -> Union[str, 'TextHandler']:
         return TextHandler(super().replace(old, new, count))
-    def upper(self):
+    def upper(self) -> Union[str, 'TextHandler']:
         return TextHandler(super().upper())
-    def lower(self):
+    def lower(self) -> Union[str, 'TextHandler']:
         return TextHandler(super().lower())
     ##############
-    def sort(self, reverse: bool = False) -> str:
+    def sort(self, reverse: bool = False) -> Union[str, 'TextHandler']:
         """Return a sorted version of the string"""
         return self.__class__("".join(sorted(self, reverse=reverse)))
-    def clean(self) -> str:
+    def clean(self) -> Union[str, 'TextHandler']:
         """Return a new version of the string after removing all white spaces and consecutive spaces"""
         data = re.sub(r'[\t|\r|\n]', '', self)
         data = re.sub(' +', ' ', data)
@@ -105,10 +127,32 @@ class TextHandler(str):
         # Check this out: https://github.com/ijl/orjson/issues/445
         return loads(str(self))
+    @typing.overload
+    def re(
+        self,
+        regex: Union[str, Pattern[str]],
+        check_match: Literal[True],
+        replace_entities: bool = True,
+        clean_match: bool = False,
+        case_sensitive: bool = False,
+    ) -> bool:
+        ...
+    @typing.overload
+    def re(
+        self,
+        regex: Union[str, Pattern[str]],
+        replace_entities: bool = True,
+        clean_match: bool = False,
+        case_sensitive: bool = False,
+        check_match: Literal[False] = False,
+    ) -> "TextHandlers[TextHandler]":
+        ...
     def re(
             self, regex: Union[str, Pattern[str]], replace_entities: bool = True, clean_match: bool = False,
             case_sensitive: bool = False, check_match: bool = False
-    ) -> Union[List[str], bool]:
+    ) -> Union["TextHandlers[TextHandler]", bool]:
         """Apply the given regex to the current text and return a list of strings with the matches.
         :param regex: Can be either a compiled regular expression or a string.
@@ -133,12 +177,12 @@ class TextHandler(str):
             results = flatten(results)
         if not replace_entities:
-            return [TextHandler(string) for string in results]
+            return TextHandlers(typing.cast(List[_TextHandlerType], [TextHandler(string) for string in results]))
-        return [TextHandler(_replace_entities(s)) for s in results]
+        return TextHandlers(typing.cast(List[_TextHandlerType], [TextHandler(_replace_entities(s)) for s in results]))
     def re_first(self, regex: Union[str, Pattern[str]], default=None, replace_entities: bool = True,
-                 clean_match: bool = False, case_sensitive: bool = False) -> Union[str, None]:
+                 clean_match: bool = False, case_sensitive: bool = False) -> "TextHandler":
         """Apply the given regex to text and return the first match if found, otherwise return the default value.
         :param regex: Can be either a compiled regular expression or a string.
@@ -158,15 +202,23 @@ class TextHandlers(List[TextHandler]):
     """
     __slots__ = ()
-    def __getitem__(self, pos: Union[SupportsIndex, slice]) -> Union[TextHandler, "TextHandlers[TextHandler]"]:
+    @typing.overload
+    def __getitem__(self, pos: SupportsIndex) -> TextHandler:
+        pass
+    @typing.overload
+    def __getitem__(self, pos: slice) -> "TextHandlers":
+        pass
+    def __getitem__(self, pos: Union[SupportsIndex, slice]) -> Union[TextHandler, "TextHandlers"]:
         lst = super().__getitem__(pos)
         if isinstance(pos, slice):
-            return self.__class__(lst)
-        else:
-            return lst
+            lst = [TextHandler(s) for s in lst]
+            return TextHandlers(typing.cast(List[_TextHandlerType], lst))
+        return typing.cast(_TextHandlerType, TextHandler(lst))
     def re(self, regex: Union[str, Pattern[str]], replace_entities: bool = True, clean_match: bool = False,
-            case_sensitive: bool = False) -> 'List[str]':
+            case_sensitive: bool = False) -> 'TextHandlers[TextHandler]':
         """Call the ``.re()`` method for each element in this list and return
         their results flattened as TextHandlers.
@@ -178,10 +230,10 @@ class TextHandlers(List[TextHandler]):
         results = [
             n.re(regex, replace_entities, clean_match, case_sensitive) for n in self
         ]
-        return flatten(results)
+        return TextHandlers(flatten(results))
     def re_first(self, regex: Union[str, Pattern[str]], default=None, replace_entities: bool = True,
-                 clean_match: bool = False, case_sensitive: bool = False) -> Union[str, None]:
+                 clean_match: bool = False, case_sensitive: bool = False) -> TextHandler:
         """Call the ``.re_first()`` method for each element in this list and return
         the first result or the default value otherwise.
@@ -210,7 +262,7 @@ class TextHandlers(List[TextHandler]):
     get_all = extract
-class AttributesHandler(Mapping):
+class AttributesHandler(Mapping[str, _TextHandlerType]):
     """A read-only mapping to use instead of the standard dictionary for the speed boost but at the same time I use it to add more functionalities.
         If standard dictionary is needed, just convert this class to dictionary with `dict` function
     """
@@ -231,7 +283,7 @@ class AttributesHandler(Mapping):
         # Fastest read-only mapping type
         self._data = MappingProxyType(mapping)
-    def get(self, key, default=None):
+    def get(self, key: str, default: Optional[str] = None) -> Union[_TextHandlerType, None]:
         """Acts like standard dictionary `.get()` method"""
         return self._data.get(key, default)
@@ -253,7 +305,7 @@ class AttributesHandler(Mapping):
         """Convert current attributes to JSON string if the attributes are JSON serializable otherwise throws error"""
         return dumps(dict(self._data))
-    def __getitem__(self, key):
+    def __getitem__(self, key: str) -> _TextHandlerType:
         return self._data[key]
     def __iter__(self):

{scrapling-0.2.92 → scrapling-0.2.93}/scrapling/core/translator.py RENAMED Viewed

@@ -139,6 +139,6 @@ class TranslatorMixin:
 class HTMLTranslator(TranslatorMixin, OriginalHTMLTranslator):
-    @lru_cache(maxsize=256)
+    @lru_cache(maxsize=2048)
     def css_to_xpath(self, css: str, prefix: str = "descendant-or-self::") -> str:
         return super().css_to_xpath(css, prefix)

scrapling-0.2.93/scrapling/defaults.py ADDED Viewed

@@ -0,0 +1,10 @@
+from .fetchers import AsyncFetcher as _AsyncFetcher
+from .fetchers import Fetcher as _Fetcher
+from .fetchers import PlayWrightFetcher as _PlayWrightFetcher
+from .fetchers import StealthyFetcher as _StealthyFetcher
+# If you are going to use Fetchers with the default settings, import them from this file instead for a cleaner looking code
+Fetcher = _Fetcher()
+AsyncFetcher = _AsyncFetcher()
+StealthyFetcher = _StealthyFetcher()
+PlayWrightFetcher = _PlayWrightFetcher()

scrapling 0.2.92__tar.gz → 0.2.93__tar.gz

scrapling 0.2.92tar.gz → 0.2.93tar.gz